2025-04-14
The goal of the workshop is to learn programming in Python using modern, reproducible tools and gain skills needed to integrated it into your own work.
Session 1
Python overview
Seting up a Python programming environment (conda, Jupyter Notebook, and VS Code IDE)
Session 2 and 3
Python basics – data structure, list comprehensive, and methods vs. functions.
pandas
: data manipulation.
Session 4
Introduction to Object-oriented programming (OOP)
Example machine learning project using scikit-learn
.
In this session, we will familiarize with Python and essential tools for building reproducible Python projects.
💡 Python overview: What is Python? How is it useful compared to R?
💡 Conda: conda
for installing packages and creating reproducible Python virtual environments
💡 Familiarize with integrated development environments (Visual Studio Code) and interactive computing tools (Jupyter Notebook)
💡 Python reproducible workflow: Create a GitHub Python repository and work in Visual Studio Code within a project-specific conda environment
Python is an open-source, high-level, interpreted, general-purpose programming language
Key Features
Python and R differ in programming philosophy, computational capacity, and extensibility.
Feature / Task | R | Python |
---|---|---|
Programming style | Function-oriented | Mostly object-oriented |
Scope of functionality | Primarily for statistical computing and data analysis | General-purpose: ML & AI, software development, scripting, etc. |
Computational power |
✅ Vectorized operations ⚠️ Memory-intensive for large data and loops |
✅ Faster loop performance ✅ Strong GPU support ✅ Efficient memory use |
Package ecosystem |
✅ Rich ecosystem of statistical tools ⚠️ Fewer ML/AI tools |
☑️ Emerging statistical packages ✅ Extensive ML/AI tools |
While R is the go-to tool for statistical analysis, Python has caught up with many equivalent libraries and functions:
statsmodels
/ scipy.stats
provide regression modeling and hypothesis testing.scikit-survival
/lifelines
support survival analysis and plotting.Python dominates in machine learning and AI development:
scikit-learn
is a comprehensive machine learning library that supports supervised regression and classification (e.g., random forests, gradient boosting) and unsupervised clustering (e.g., K-means) analysesTensorFlow
/ PyTorch
are deep learning libraries widely used for computer vision and natural language processing (NLP).optuna
/ Ray
can be integrated into ML/DL workflows for easy and efficient model training, hyperparameter tuning, fine-tuning, etc.Emerging bioinformatics packages that provide standard omics data preprocessing and analysis pipeliness:
scanpy
, anndata
are libraries for single-cell RNA-seq data loading, preprocessing, and analysisBiopython
is a set of tools for biological computation that performs file parsering, sequence analysis, clustering algorithms, etc.pysam
works with raw input files (e.g., BAM/SAM/VCF)Let’s familiarize with the essential software tools for Python programming and project building.
conda
, a powerful package and environment managerConda is a cross-platform, multi-language command line interface (CLI) for managing packages and environments.
The Latter provides the Anaconda Navigator application which allows you to manage packages and environments without having to use the command line.
Environments are isolated, self-contained workspace that includes its own language interpreter and package dependencies
Example:
In this case, creating virtual environments allows these Python installations to operate fully independently and not interfere with each other.
You may find the flexibility of environments beneficial in many cases.
.yml
file.conda
commandsLet’s walk through some useful commands.
💡Open Anaconda Prompt (Windows) or Terminal (macOS) to execute commands
You can install, update, and remove packages from specific channels (pacakge inventories).
To install packages from the default Anaconda channel:
When installing multiple packages, Conda resolves dependencies across them
To install packages located in another channel, (e.g., conda-forge
):
Note: you can use -c
(shorthand) and --channel
interchangeably.
—
conda install |
pip install |
|
---|---|---|
Package types | Python + non-Python | Python-only |
Package source | Conda channels | Python Package Index (PyPI) |
Dependency resolution | ✅ Comprehensive | ⚠️ Limited, no cross-checks |
Conda cannot track packages installed by pip. Using conda
and pip
back-to-back can overwrite and potentially break existing packages.
💡Best practice: When working in a conda environment, install everything with conda first, then use pip only when the package is not available in conda.
Check out this article for more information on using pip in a conda environment.
To update a package:
This automatically updates the package to the highest version supported by the current Python series.
You can create, activate, update, export, and remove virtual environments with conda
.
To create a conda virtual environment
Note: you can use -n
(shorthand) and --name
interchangeably.
Avoid activating on top of another virtual environment!
Always conda deactivate
first before activating another one because environments can be stacked, which can potentially break both environments. 💡Tip: make sure you see (base)
at the beginning of the terminal prompt when activating an environment.
Conda environments can be exported and shared as .yml
files for reproducibility:
Example generic .yml
:
Another way to quickly export information about packages and their versions:
This will save an environment.yml
file to your current working directory.
Note: --no-builds
removes build information from dependencies for simplicity.
Now, you can ceate the environment with:
We will later create an environment for the workshops. 👉Downlaod environment.yml
An Integrated Development Environment (IDE) is a suite of tools contained in a software application, which typically includes:
An IDE brings together everything you need to write and run code and manage projects.
Visual Studio Code (VS Code) is one of the most popular open-source code editors with many features.
Jupyter Notebook
Jupyter is an interactive computing tool that lets you combine executable code, Markdown text, and visual components in one file called a notebook (.ipynb
).
VS Code + Jupyter notebooks for Python = RStudio + RMarkdowns for R.
Git is a version control system that tracks changes in code and project files over time.
GitHub is a cloud-based platform that hosts Git repositories for sharing and collaboration.
Note: Git and GitHub are not the same. Git is the software tool for version control and can be used without GitHub, while the latter provides the online hosting service.
Why use Git for projects?
Git works seamlessly with VS Code and Jupyter Notebooks
README.md
and .gitignore
(choose Python template)Alternatively…
Go to practice repository: https://github.mskcc.org/Python-Workshop/workshop_proj
Fork the repo to create a copy on your GitHub account
Now, you should have a forked copy of the repository on your own GitHub account (e.g. username/workshop_proj
)
Organize the project with separate data/
and notebooks/
folders and environment.yml
file to save virtual environment information.
Here is an example project structure.
👉Downlaod workshop environment.yml file and add to the parent directory.
Note: If you do not use GitHub – that is fine! You can skip the previous step of cloning from GitHub and directly create a local folder with subfolders and files.
Note: Ensure that Miniconda is installed. For MacOs, ensure conda
is correctly initialized with conda init
(see installing Miniconda)
Launch Anaconda Prompt (Windows) or Terminal (macOS).
Chagne directory to your project folder with cd
Create the environment with:
Ctrl+Shift+X
or Cmd+Shift+X
) Select Open Folder… from the VS Code Welcome page:
Or by selecting File > Open Folder (Ctrl+K Ctrl+O
)
Open the Command Palette (Ctrl+Shift+P
or Cmd+Shift+P
)
Select Python: Select Interpreter
Choose the interpreter associated with the conda environment we created: Python 3.10.13 (python-intro-env)
If the environment does not appear, click Enter interpreter path…
C:\Users\<username>\miniconda3\envs\myenv\python.exe
/Users/<username>/miniconda3/envs/myenv/bin/python
Create a Jupyter Notebook with Create: New Jupyter Notebook from the Command Palette (Ctrl+Shift+P
) or by creating a new .ipynb
file from the left hand side Explorer panel
Select a Python interpreter by clicking click Select Kernel in the upper right corner
Choose the appropriate conda environment: python-intro-env (Python 3.10.13)
Now, create cells with desired cell types–Python
(default); Markdown
; …
Test kernel selection with:
This should return the local path to the Python executable associated with your conda environment.
Try import packages installed to our environment:
You are all set with getting started to code in Python 👏
Q: Can I create virtual environments inside project folders rather then the default?
Yes, you can create the environment elsewhere than the default location (e.g., ~/miniconda3/envs
)
You would specify the pathway when creating the environment:
Cons: Might lead to slower performance and permission issues (e.g., saving to H Drive)
Q: Can I create one env per project? Would too many envs be a problem?
When a project is finished, export environment information and remove it from disk:
This saves a snapshot of the project denpendencies for reproducibility!
Questions?