Python Installation, Environments, and VS Code Setup

Python Group

2025-04-14

Helpful Resources


Workshop website: Introduction to Python Workshops
👉Find session handouts, assignments, and recordings
👉Ask questions (GitHub Discussion page)

Books / Websites:
The Python for Data Analysis, 3ed written by Wes McKinney is a helpful reference for Python beginners. The book focuses on data analysis and introduces relevant packages including NumPy, pandas (author is one of the creators!), statsmodels, and scikit-learn, etc.

Prerequisite
  • MSK laptop/Remote Desktop
  • MSK Github Enterprise account
  • Programming experience and minimal command line skills
  • Workshop Overview

    The goal of the workshop is to learn programming in Python using modern, reproducible tools and gain skills needed to integrated it into your own work.

    Session 1

    • Python overview

    • Seting up a Python programming environment (conda, Jupyter Notebook, and VS Code IDE)

    Session 2 and 3

    • Python basics – data structure, list comprehensive, and methods vs. functions.

    • pandas: data manipulation.

    Session 4

    • Introduction to Object-oriented programming (OOP)

    • Example machine learning project using scikit-learn.

    Session 1 Learning Goals

    In this session, we will familiarize with Python and essential tools for building reproducible Python projects.

    💡 Python overview: What is Python? How is it useful compared to R?
    💡 Conda: conda for installing packages and creating reproducible Python virtual environments
    💡 Familiarize with integrated development environments (Visual Studio Code) and interactive computing tools (Jupyter Notebook)
    💡 Python reproducible workflow: Create a GitHub Python repository and work in Visual Studio Code within a project-specific conda environment

    Introduction to Python

    What is Python?


    Python is an open-source, high-level, interpreted, general-purpose programming language

    Key Features

    • Easy to use: Python syntax is designed to be readable and user-friendly
    • General-purpose: widely applied in data science, ML/AL, software development, automation, etc.
    • Interpreted: run code without compiling (unlike C/C++)
    • Open-source: free to use and supported by a mass user community, who contributes to many high-quality packages open to public

    Python vs. R

    Python and R differ in programming philosophy, computational capacity, and extensibility.

    Feature / Task R Python
    Programming style Function-oriented Mostly object-oriented
    Scope of functionality Primarily for statistical computing and data analysis General-purpose: ML & AI, software development, scripting, etc.
    Computational power ✅ Vectorized operations
    ⚠️ Memory-intensive for large data and loops
    ✅ Faster loop performance
    ✅ Strong GPU support
    ✅ Efficient memory use
    Package ecosystem ✅ Rich ecosystem of statistical tools
    ⚠️ Fewer ML/AI tools
    ☑️ Emerging statistical packages
    ✅ Extensive ML/AI tools

    Python Package Ecosystem

    Statistical analysis

    While R is the go-to tool for statistical analysis, Python has caught up with many equivalent libraries and functions:

    • statsmodels / scipy.stats provide regression modeling and hypothesis testing.
    • scikit-survival /lifelines support survival analysis and plotting.

    ML/DL ecosystem

    Python dominates in machine learning and AI development:

    • scikit-learn is a comprehensive machine learning library that supports supervised regression and classification (e.g., random forests, gradient boosting) and unsupervised clustering (e.g., K-means) analyses
    • TensorFlow / PyTorch are deep learning libraries widely used for computer vision and natural language processing (NLP).
    • optuna / Ray can be integrated into ML/DL workflows for easy and efficient model training, hyperparameter tuning, fine-tuning, etc.

    Omics data analysis

    Emerging bioinformatics packages that provide standard omics data preprocessing and analysis pipeliness:

    • scanpy, anndata are libraries for single-cell RNA-seq data loading, preprocessing, and analysis
    • Biopython is a set of tools for biological computation that performs file parsering, sequence analysis, clustering algorithms, etc.
    • pysam works with raw input files (e.g., BAM/SAM/VCF)

    Essential Tools

    Let’s familiarize with the essential software tools for Python programming and project building.

    • Miniconda: provides Python installation and conda, a powerful package and environment manager
    • Visual Studio Code: a lightweight code editor that integrates programming + plots + terminal + git + …
    • Jupyter Notebook: an interactive computing tool that combines code execution + markdown + visualizations
    • Git (GitHub): for version control, collaboration, and publishing code

    Conda

    Conda is a cross-platform, multi-language command line interface (CLI) for managing packages and environments.

  • Multi-language: supports both Python and non-Python pacakges (e.g., R, C/C++)
  • Environment management: creates virtual environments with specific Python versions and dependencies for reproducibility.
  • Miniconda vs. Anaconda

    You can install conda via both installers:
  • Miniconda (minimal version)
  • Anaconda Distribution (full version)

  • The Latter provides the Anaconda Navigator application which allows you to manage packages and environments without having to use the command line.

    Conda Virtual Environments

    Environments are isolated, self-contained workspace that includes its own language interpreter and package dependencies

    Example:
  • Global environment (system): Miniconda installation of Python 3.12 + dependencies
  • Virtual environment1: Python 3.8 + dependencies
  • Virtual environment2: R 4.4.2 + dependencies

  • In this case, creating virtual environments allows these Python installations to operate fully independently and not interfere with each other.

    Why Use Virtual Environments?


    You may find the flexibility of environments beneficial in many cases.

    • Avoid Conflicts. Help resolve potential conflicts between different projects concerning conflicting dependencies. Changes made to one environment won’t affect other projects that use different environments.
    • Easy Management. Reduce the risk of breaking system Python and globally installed packages. You can easily delete a virtual environment if issues occur and recreate it.
    • Reproducibility. Work as time capsules, allowing you to replicate the requirement of a project at later time points or on new machines.
    • Sharing Environments. Allow sharing the Python version and entire list of dependencies with other people through a copy of the .yml file.

    Overview of conda commands

    Let’s walk through some useful commands.

    💡Open Anaconda Prompt (Windows) or Terminal (macOS) to execute commands

    Installing Packages with Conda

    You can install, update, and remove packages from specific channels (pacakge inventories).

    To install packages from the default Anaconda channel:

    # Install a single package
    conda install scipy
    
    # Install a specific version of a package
    conda install scipy=0.15.0 
    
    # Install multiple packages
    conda install scipy=0.15.0 pandas matplotlib

    When installing multiple packages, Conda resolves dependencies across them

    To install packages located in another channel, (e.g., conda-forge):

    conda install conda-forge::pytorch 
    # or 
    conda install pytorch --channel conda-forge
    
    # Include multiple channels for package search
    conda install pytorch -c conda-forge -c bioconda

    Note: you can use -c (shorthand) and --channel interchangeably.


    Conda install vs. Pip install

    conda install pip install
    Package types Python + non-Python Python-only
    Package source Conda channels Python Package Index (PyPI)
    Dependency resolution ✅ Comprehensive ⚠️ Limited, no cross-checks

    Conda cannot track packages installed by pip. Using conda and pip back-to-back can overwrite and potentially break existing packages.

    💡Best practice: When working in a conda environment, install everything with conda first, then use pip only when the package is not available in conda.

    Check out this article for more information on using pip in a conda environment.

    Updating and Removing Packages with Conda

    To update a package:

    conda update scipy

    This automatically updates the package to the highest version supported by the current Python series.

    • Example. Python 3.9 updates to the highest available in the 3.x series.

    To remove a package:

    # To remove a package
    conda remove scipy 
    
    # To remove multiple packages at once
    conda remove scipy pandas matplotlib

    Creating an Environment with Conda

    You can create, activate, update, export, and remove virtual environments with conda.

    To create a conda virtual environment

    conda create --name myenv

    Note: you can use -n (shorthand) and --name interchangeably.


    To create an environment with a specific Python version and install packages to it:

    conda create -n myenv python=3.10
    conda activate myenv
    conda install jupyter ipykernel matplotlib -c conda-forge

    or do it in one line:

    conda create -n myenv python=3.10 jupyter ipykernel matplotlib -c conda-forge

    Creating an Environment with Conda (Conti.)

    To acitvate an environment:

    conda activate myenv

    To deactivate the current environment (no need to specify the name):

    conda deactivate

    Avoid activating on top of another virtual environment!
    Always conda deactivate first before activating another one because environments can be stacked, which can potentially break both environments. 💡Tip: make sure you see (base) at the beginning of the terminal prompt when activating an environment.

    To remove an environment entirely:

    conda env remove -n myenv

    You can look up the names and locations of all the environments on your computer with:

    conda env lsit

    Creating Environments from YML Files

    Conda environments can be exported and shared as .yml files for reproducibility:

    Example generic .yml:

    name: python310
    channels:
      - defaults
    dependencies:
      - python=3.10
      - pandas
      - numpy

    Another way to quickly export information about packages and their versions:

    conda activate myenv
    conda env export --no-builds > environment.yml

    This will save an environment.yml file to your current working directory.
    Note: --no-builds removes build information from dependencies for simplicity.

    Now, you can ceate the environment with:

    conda env create -f environment.yml
    conda activate python310

    We will later create an environment for the workshops. 👉Downlaod environment.yml

    Integrated Development Environment

    An Integrated Development Environment (IDE) is a suite of tools contained in a software application, which typically includes:

    • A source code editor
    • A compiler or interpreter to execute code
    • A built-in debugger
    • Environment management and version control systems for development workflows

    An IDE brings together everything you need to write and run code and manage projects.

    Visual Studio Code


    Visual Studio Code (VS Code) is one of the most popular open-source code editors with many features.

    • Lightweight
    • Multi-language support for Python, R, C++, etc.
    • Integrated Git version control
    • Extensible features like Jupyter notebooks, Quarto, remote connection

    Jupyter Notebook

    Jupyter is an interactive computing tool that lets you combine executable code, Markdown text, and visual components in one file called a notebook (.ipynb).


    VS Code + Jupyter notebooks for Python = RStudio + RMarkdowns for R.

    Git (GitHub)


    Git is a version control system that tracks changes in code and project files over time.
    GitHub is a cloud-based platform that hosts Git repositories for sharing and collaboration.

    Note: Git and GitHub are not the same. Git is the software tool for version control and can be used without GitHub, while the latter provides the online hosting service.

    Why use Git for projects?

    • Track and commit changes to files in a repository
    • Revert or compare previous versions when something breaks
    • Branching allows teamwork in parallel without overwriting each other’s work
    • Backup your research projects in a centralized location
    • Sharing code for publication purposes

    Git works seamlessly with VS Code and Jupyter Notebooks

    Python Reproducible Workflow


    Prerequisite
  • Install Conda (Miniconda)
  • Install VS Code
  • GitHub account (to practice using GitHub version control for projects)
  • (optional) Install GitHub Desktop
    • Create a GitHub repository and clone folder to the local computer
    • Organize the project directory with a clean structure
    • Create a project-specific conda virtual environment
    • Work in VS Code with corresponding Python environment

    Create a GitHub Project - Example

    • Create new repository from GitHub (public) or GitHub Enterprise (MSK Secured)
    • Initialize with README.md and .gitignore (choose Python template)

    Alternatively…

    • Now, clone the repository to your local computer by clicking

    • Option 1 (CLI): Copy the HTTPS URL and clone with git command

      git clone https://github.mskcc.org/<username>/workshop_proj.git
    • Option 2 (GUI): Open in GitHub Desktop

    Organize the Project Directory

    Organize the project with separate data/ and notebooks/ folders and environment.yml file to save virtual environment information.

    • Here is an example project structure.

      workshop_proj/
         ├── data/            # Raw & processed data folder
         ├── notebooks/       # Jupyter notebooks and scripts
         ├── environment.yml  # Conda environment YML file
         ├── .gitignore       # Files that git should not track
         └── README.md        # Project description 
    • 👉Downlaod workshop environment.yml file and add to the parent directory.

    Note: If you do not use GitHub – that is fine! You can skip the previous step of cloning from GitHub and directly create a local folder with subfolders and files.

    Create a Project-Specific Conda Environment

    Note: Ensure that Miniconda is installed. For MacOs, ensure conda is correctly initialized with conda init (see installing Miniconda)

    1. Launch Anaconda Prompt (Windows) or Terminal (macOS).

    2. Chagne directory to your project folder with cd

    3. Create the environment with:

      conda env create -f environment.yml
      conda activate python-intro-env

    Set up VS Code for Python & Jupyter

    1. Launch VS Code
    2. Open the Extensions Marketplace (Ctrl+Shift+X or Cmd+Shift+X)
    3. Install the following extensions:
      • Python (auto-, debugging, and interpreter selection)
      • Jupyter (notebook interface and interactive execution)
      • Note: You still need local installations of Python to enable the extension.

    Work in Jupyter Notebooks in VS Code

    1. Open your project folder in VS Code:
      • Select Open Folder… from the VS Code Welcome page:

      • Or by selecting File > Open Folder (Ctrl+K Ctrl+O)

    1. Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P)

    2. Select Python: Select Interpreter

    3. Choose the interpreter associated with the conda environment we created: Python 3.10.13 (python-intro-env)

      • If the environment does not appear, click Enter interpreter path…

        • Windows: C:\Users\<username>\miniconda3\envs\myenv\python.exe
        • macOS: /Users/<username>/miniconda3/envs/myenv/bin/python
    1. Create a Jupyter Notebook with Create: New Jupyter Notebook from the Command Palette (Ctrl+Shift+P) or by creating a new .ipynb file from the left hand side Explorer panel

    2. Select a Python interpreter by clicking click Select Kernel in the upper right corner

    3. Choose the appropriate conda environment: python-intro-env (Python 3.10.13)

    1. Now, create cells with desired cell types–Python (default); Markdown; …

    2. Test kernel selection with:

      import sys
      print(sys.executable)

      This should return the local path to the Python executable associated with your conda environment.

      Try import packages installed to our environment:

      import pandas
      import sklearn
      import great_tables

    You are all set with getting started to code in Python 👏

    🔥Tips:
    Reproducibility & Environments

    Q: Can I create virtual environments inside project folders rather then the default?

    • Yes, you can create the environment elsewhere than the default location (e.g., ~/miniconda3/envs)

    • You would specify the pathway when creating the environment:

      conda env create --prefix </path/to/your/proj/env>
    • Cons: Might lead to slower performance and permission issues (e.g., saving to H Drive)

    Q: Can I create one env per project? Would too many envs be a problem?

    • Yes, it is recommended that you create project-specific environments!
    • An environment normally takes ~200MB - 1GB. If you are low on disk space, too many active ones might be an issue.
    • Solution: Clean up unused environments from finished projects.
      • When a project is finished, export environment information and remove it from disk:

        conda env export --no-builds > environment.yml
        conda env remove -n <old-env-name>
      • This saves a snapshot of the project denpendencies for reproducibility!

    Takeaways

    • Python is built around objects with associated attributes and functions. This modular setup allows high extensibility of Python programs to new features.
    • Conda virtual environments help isolate Python versions across projects and automatically resolve dependency conflicts within each environment.
    • Use Git/GitHub for version control, project back up, code sharing, and collaboration.
    • VS Code allows integration of Jupyter notebooks, conda virtual environments, and automatic Git file tracking.


    Questions?