Basics of package creation in Python

How to create and install Python packages

Author

Dani V

Introduction

Not all reusable code needs to become a large, fancy package like pandas. However, packaging your own functions and workflows is great for organization and reusability. Unlike R’s CRAN, Python’s PyPI does not enforce a formal review process before publishing. This makes distribution easier but it is still the developer’s responsibility to make sure the package is usable.

As a side note, preparing to put code into a package can involve some changes to the original code:
- modularity
- putting code into functions with variables instead of hardcoding
- thinking about other potential use cases for the code blocks you write

1. Creating a basic Python package.

The Python Packaging User Guide has a more in-depth tutorial for package creation.

Using github is not a requirement for creating and publishing python packages, but I find it useful for managing my package files and documentation. It’s also possible to install python packages directly from a github repo without needing to publish to PyPI, which can be convenient for personal projects.

Basic requirements for a package

A minimal Python package contains:

A project directory folder (does not need to have the package name). This will be your working directory while you build the package. This could also be your github repo folder for the package.
A LICENSE file
A pyproject.toml file containing the package metadata (authors, license) and build requirements
A README.md file with a basic description of the package
A source (‘src’) folder to contain the python scripts
A folder (with the package name) with an init.py file (which will become a module/subpackage)
An optional ‘tests’ folder

For example:

project_directory/
    LICENSE
    pyproject.toml
    README.md
    src/
        mypkg/
            __init__.py
            core.py
    tests/

You can leave the extra files blank for now (we’ll fill them in as we go), but it’s helpful to have the package structure created to start with.

Make sure you set your working directory to the project directory folder you made!

Now is a good time to set up a github repo using your project_directory folder if you have not done so already.

What not to include

Large datasets (unless essential)
Secrets, credentials, API keys, or private configuration
Temporary notebooks, experimental scripts, or build artifacts
Virtual-environment directories (venv/, .conda/, .Rproj.user/, etc.)

Functions, Objects, Data..?

In practice, we often begin with long-form exploratory scripts— Jupyter notebooks, quarto documents or single Python files filled with code written sequentially. Turning that into a reusable Python package requires reorganizing the code into modular functions, optional classes, and any data assets that belong with the package.

This section focuses on how to convert existing code into components that are clean, testable, and reusable inside a package.

Functions

Functions are used for repeated processes and will live in the individual python scripts in your ‘mypkg’ folder. You can have multiple functions per python script, but it can be helpful to have multiple scripts.

There are many different ways to ‘function-ize’ your existing code but here are some things to start with:
- Look for repeated patterns/logical steps (ex: blocks of code you copy-pasted several times). For example:
- Loading/cleaning data
- Repeated calculations (like scores or metrics)
- Generating plots
- Data transformations
- Convert fixed values into parameters
- things like file paths, column names, number of iterations, etc should be variables rather than hardcoded strings/numbers
- you can set helpful default values for some parameters instead of hardcoding. This lets you run the function without having to define these parameters each time unless you want to change them.

Default parameter values

Default values let you make functions flexible without forcing users to specify every argument. They work similarly to R function defaults, but Python has stricter rules about the order of parameters.

Key rules

All non-default parameters must come before any parameters with default values. This is required by Python’s function syntax.
Valid:
```
def func(a, b=10):
    ...
```
Invalid:
```
def func(a=10, b):
    ...
```
Defaults are evaluated once, at function definition time.
This is why you should never use mutable defaults such as [], {}, or pd.DataFrame(). Use None instead and initialize inside the function.

Example:
```
def add_item(item, container=None):
    if container is None:
        container = []
    container.append(item)
    return container
```
Defaults make functions easier to repurpose.
Instead of hardcoding:
```
df = df[df["score"] > 0.7]
```
Use:
```
def filter_scores(df, column="score", threshold=0.7):
    return df[df[column] > threshold]
```
The function now works for many situations while still being simple to call.

By converting hardcoded values in your script into defaults, you maintain the original behavior while making your code reusable in a package.

Expose useful parameters of existing functions you call within your code
Break complex operations into multiple helper functions:
- Each helper should have one job
- Helpers can be recombined for different functionality

Example: breaking a large function into helpers

Instead of:    

```{.python}   
def process(df):
    # clean
    df = df.dropna()
    df["x"] = df["x"] * 100
    
    # model
    m = LinearRegression().fit(df[["x"]], df["y"])
    
    # format output
    preds = m.predict(df[["x"]])
    return df.assign(preds=preds)  
```

Refactor:  

```{.python}   
def clean(df):
    df = df.dropna().copy()
    df["x"] = df["x"] * 100
    return df

def fit_model(df):
    model = LinearRegression().fit(df[["x"]], df["y"])
    return model

def add_predictions(df, model):
    df = df.copy()
    df["preds"] = model.predict(df[["x"]])
    return df

def process(df):
    df_clean = clean(df)
    model = fit_model(df_clean)
    return add_predictions(df_clean, model)
```

Add input validation and informative errors. Scripts can assume everything is correct. Packages cannot. Add clear checks for types, shapes, and expected content.

Example: validating inputs

def load_and_filter(path, threshold):
    if not isinstance(path, str):
        raise TypeError("`path` must be a filepath string.")

    if threshold < 0 or threshold > 1:
        raise ValueError("`threshold` must be between 0 and 1.")

    df = pd.read_csv(path)
    return df[df["score"] > threshold]

Informative error messages help you debug things more effectively later!

Keep side effects optional or isolated. Scripts often print intermediate results, create files and/or make plots. In a package, you don’t want functions creating things the user did not ask for.
- Avoid printing unless requested (use verbose argument to toggle print statements)
- Avoid writing files to disk automatically and avoid hardcoded filepaths
- Avoid plot generation within core logic (have separate plotting functions if needed)

Objects

Not all packages require classes, but they help when your script manages ‘state’ across multiple steps (ex: models in trained and untrained state). If your existing code stores intermediate results in many global variables or relies heavily on shared state, a class can replace that pattern.

When class-based refactoring is useful:
- The code manages an evolving object (models, pipelines, simulations)
- Several functions operate on the same internal data (these can become methods)
- You want a user-friendly API like obj.fit(), obj.predict()
- A convenient datastructure doesn’t exist yet

If your script behaves like a workflow, a class can bundle the pieces:

Example: turning a multi-step workflow into a class

Instead of:

df_clean = clean(df)
model = fit_model(df_clean)
preds = predict(model, df_clean)

A class:

class SimpleModel:
    def __init__(self, multiplier=100):
        self.multiplier = multiplier
        self.model = None

    def fit(self, df):
        df = df.dropna().copy()
        df["x"] = df["x"] * self.multiplier
        self.model = LinearRegression().fit(df[["x"]], df["y"])
        return self

    def predict(self, df):
        df = df.copy()
        df["x"] = df["x"] * self.multiplier
        return self.model.predict(df[["x"]])

The class-based structure tends to emerge naturally from script workflows.

Docstrings

Docstrings are function/object descriptions written in blockquotes {““” “““} and placed directly below the function/method definition. They the Python equivalent of Roxygen comments and are accessable via help() or IDE tooltips after the package is created. Docstrings are also used by Sphinx to auto-generate documentation (which can be published via readthedocs with little to no editing), so they are worth investing a bit of time in.

It’s helpful to create your docstrings while you write your code, rather than after the fact. They do not have to be complicated, but at minimum should:
- Describe what the function/class does
- List parameters and return values

Choosing a consistant docstring format to follow can also help to avoid formating frustration later on!!

Click for more on docstring formats

Google style A readable, indentation-based format. Parameters and returns are listed under clear section headers. Emphasizes simplicity.

Example:

def func(a, b):
    """Adds two numbers.

    Args:
        a (int): First number.
        b (int): Second number.

    Returns:
        int: Sum of a and b.
    """

NumPy style
Uses structured section headers separated by underlines. Popular in scientific computing. Works well with Sphinx’s numpydoc extension.

Example:

def func(a, b):
    """Adds two numbers.

    Parameters
    ----------
    a : int
        First number.
    b : int
        Second number.

    Returns
    -------
    int
        Sum of a and b.
    """

reStructuredText (reST)
Sphinx’s native documentation format. More verbose, uses explicit markup, and supports cross-referencing and rich structure.

Example:

def func(a, b):
    """Adds two numbers.

    :param a: First number.
    :type a: int
    :param b: Second number.
    :type b: int
    :returns: Sum of a and b.
    :rtype: int
    """

In practice, all three work well with Sphinx, but Google and NumPy styles are typically easier to read and write.

Mkdocs, which uses markdown instead of rst, can be easier for some than Sphinx, but I haven’t used it before. It does not have a built-in automatic api documentation like Sphinx, but mkdocstrings can help with that part.

Data

If your script uses sample datasets for demonstration or built-in examples, these can be included in the package. The major change when going from script to package is that your code should not assume a working directory.

How to include sample data safely

Put small demonstration files under mypkg/data/.

Use importlib.resources to load data independent of the user’s environment:

import importlib.resources as res
import pandas as pd

def load_sample():
    with res.files("mypkg.data").joinpath("sample.csv").open() as f:
        return pd.read_csv(f)

This avoids the common “file not found” problem from script-based code.

Those extra files

pyproject.toml

Defines project structure and metadata.

tests (optional)

What are unit tests and why are they important. Using pytest for tests. How to design good unit tests.

2. Publishing a package.

Aside: Liscencing and citation or whatever

a. Putting the package on github

Using source control is important and starting with that from the beginning makes it easier to make the package usable to your later self and others.

b. pypi/testpypi

Making package available for use on PyPI. What is pypi and why would you want stuff on it?

c. github workflows for automated package publishing

What are github workflows. What are good practices for workflows for publishing.

3. Documentation

aside: good vs bad documentation

a. sphinx + readthedocs

Ways to create documentation site.

Docstrings reprised

What goes in a docstring

Introduction

1. Creating a basic Python package.

Basic requirements for a package

What not to include

Functions, Objects, Data..?

Functions

Objects

Docstrings

Data

Those extra files

pyproject.toml

tests (optional)

2. Publishing a package.

Aside: Liscencing and citation or whatever

a. Putting the package on github

b. pypi/testpypi

c. github workflows for automated package publishing

3. Documentation

a. sphinx + readthedocs

Docstrings reprised

b. github pages