Basics of package creation in Python
How to create and install Python packages
Introduction
Not all reusable code needs to become a large, fancy package like pandas. However, packaging your own functions and workflows is great for organization and reusability. Unlike R’s CRAN, Python’s PyPI does not enforce a formal review process before publishing. This makes distribution easier but it is still the developer’s responsibility to make sure the package is usable.
As a side note, preparing to put code into a package can involve some changes to the original code:
- modularity
- putting code into functions with variables instead of hardcoding
- thinking about other potential use cases for the code blocks you write
1. Creating a basic Python package.
The Python Packaging User Guide has a more in-depth tutorial for package creation.
Using github is not a requirement for creating and publishing python packages, but I find it useful for managing my package files and documentation. It’s also possible to install python packages directly from a github repo without needing to publish to PyPI, which can be convenient for personal projects.
Basic requirements for a package
A minimal Python package contains:
- A project directory folder (does not need to have the package name). This will be your working directory while you build the package. This could also be your github repo folder for the package.
- A LICENSE file
- A pyproject.toml file containing the package metadata (authors, license) and build requirements
- A README.md file with a basic description of the package
- A source (‘src’) folder to contain the python scripts
- A folder (with the package name) with an init.py file (which will become a module/subpackage)
- An optional ‘tests’ folder
For example:
You can leave the extra files blank for now (we’ll fill them in as we go), but it’s helpful to have the package structure created to start with.
Make sure you set your working directory to the project directory folder you made!
Now is a good time to set up a github repo using your project_directory folder if you have not done so already.
What not to include
- Large datasets (unless essential)
- Secrets, credentials, API keys, or private configuration
- Temporary notebooks, experimental scripts, or build artifacts
- Virtual-environment directories (venv/, .conda/, .Rproj.user/, etc.)
Functions, Objects, Data..?
In practice, we often begin with long-form exploratory scripts— Jupyter notebooks, quarto documents or single Python files filled with code written sequentially. Turning that into a reusable Python package requires reorganizing the code into modular functions, optional classes, and any data assets that belong with the package.
This section focuses on how to convert existing code into components that are clean, testable, and reusable inside a package.
Functions
Functions are used for repeated processes and will live in the individual python scripts in your ‘mypkg’ folder. You can have multiple functions per python script, but it can be helpful to have multiple scripts.
There are many different ways to ‘function-ize’ your existing code but here are some things to start with:
- Look for repeated patterns/logical steps (ex: blocks of code you copy-pasted several times). For example:
- Loading/cleaning data
- Repeated calculations (like scores or metrics)
- Generating plots
- Data transformations
- Convert fixed values into parameters
- things like file paths, column names, number of iterations, etc should be variables rather than hardcoded strings/numbers
- you can set helpful default values for some parameters instead of hardcoding. This lets you run the function without having to define these parameters each time unless you want to change them.
Default parameter values
Default values let you make functions flexible without forcing users to specify every argument. They work similarly to R function defaults, but Python has stricter rules about the order of parameters.
All non-default parameters must come before any parameters with default values. This is required by Python’s function syntax.
Valid:Invalid:
Defaults are evaluated once, at function definition time.
This is why you should never use mutable defaults such as[],{}, orpd.DataFrame(). UseNoneinstead and initialize inside the function.Example:
Defaults make functions easier to repurpose.
Instead of hardcoding:Use:
The function now works for many situations while still being simple to call.
By converting hardcoded values in your script into defaults, you maintain the original behavior while making your code reusable in a package.
Expose useful parameters of existing functions you call within your code
Break complex operations into multiple helper functions:
- Each helper should have one job
- Helpers can be recombined for different functionality
- Each helper should have one job
Example: breaking a large function into helpers
Instead of:
```{.python}
def process(df):
# clean
df = df.dropna()
df["x"] = df["x"] * 100
# model
m = LinearRegression().fit(df[["x"]], df["y"])
# format output
preds = m.predict(df[["x"]])
return df.assign(preds=preds)
```
Refactor:
```{.python}
def clean(df):
df = df.dropna().copy()
df["x"] = df["x"] * 100
return df
def fit_model(df):
model = LinearRegression().fit(df[["x"]], df["y"])
return model
def add_predictions(df, model):
df = df.copy()
df["preds"] = model.predict(df[["x"]])
return df
def process(df):
df_clean = clean(df)
model = fit_model(df_clean)
return add_predictions(df_clean, model)
```
- Add input validation and informative errors. Scripts can assume everything is correct. Packages cannot. Add clear checks for types, shapes, and expected content.
Example: validating inputs
Informative error messages help you debug things more effectively later!
- Keep side effects optional or isolated. Scripts often print intermediate results, create files and/or make plots. In a package, you don’t want functions creating things the user did not ask for.
- Avoid printing unless requested (use verbose argument to toggle print statements)
- Avoid writing files to disk automatically and avoid hardcoded filepaths
- Avoid plot generation within core logic (have separate plotting functions if needed)
- Avoid printing unless requested (use verbose argument to toggle print statements)
Objects
Not all packages require classes, but they help when your script manages ‘state’ across multiple steps (ex: models in trained and untrained state). If your existing code stores intermediate results in many global variables or relies heavily on shared state, a class can replace that pattern.
When class-based refactoring is useful:
- The code manages an evolving object (models, pipelines, simulations)
- Several functions operate on the same internal data (these can become methods)
- You want a user-friendly API like obj.fit(), obj.predict()
- A convenient datastructure doesn’t exist yet
If your script behaves like a workflow, a class can bundle the pieces:
Example: turning a multi-step workflow into a class
Instead of:
A class:
class SimpleModel:
def __init__(self, multiplier=100):
self.multiplier = multiplier
self.model = None
def fit(self, df):
df = df.dropna().copy()
df["x"] = df["x"] * self.multiplier
self.model = LinearRegression().fit(df[["x"]], df["y"])
return self
def predict(self, df):
df = df.copy()
df["x"] = df["x"] * self.multiplier
return self.model.predict(df[["x"]])The class-based structure tends to emerge naturally from script workflows.
Docstrings
Docstrings are function/object descriptions written in blockquotes {““” “““} and placed directly below the function/method definition. They the Python equivalent of Roxygen comments and are accessable via help() or IDE tooltips after the package is created. Docstrings are also used by Sphinx to auto-generate documentation (which can be published via readthedocs with little to no editing), so they are worth investing a bit of time in.
It’s helpful to create your docstrings while you write your code, rather than after the fact. They do not have to be complicated, but at minimum should:
- Describe what the function/class does
- List parameters and return values
Choosing a consistant docstring format to follow can also help to avoid formating frustration later on!!
Click for more on docstring formats
Google style A readable, indentation-based format. Parameters and returns are listed under clear section headers. Emphasizes simplicity.
Example:
def func(a, b):
"""Adds two numbers.
Args:
a (int): First number.
b (int): Second number.
Returns:
int: Sum of a and b.
"""
NumPy style
Uses structured section headers separated by underlines. Popular in scientific computing. Works well with Sphinx’s numpydoc extension.
Example:
def func(a, b):
"""Adds two numbers.
Parameters
----------
a : int
First number.
b : int
Second number.
Returns
-------
int
Sum of a and b.
"""
reStructuredText (reST)
Sphinx’s native documentation format. More verbose, uses explicit markup, and supports cross-referencing and rich structure.
Example:
def func(a, b):
"""Adds two numbers.
:param a: First number.
:type a: int
:param b: Second number.
:type b: int
:returns: Sum of a and b.
:rtype: int
"""
In practice, all three work well with Sphinx, but Google and NumPy styles are typically easier to read and write.
Mkdocs, which uses markdown instead of rst, can be easier for some than Sphinx, but I haven’t used it before. It does not have a built-in automatic api documentation like Sphinx, but mkdocstrings can help with that part.
Data
If your script uses sample datasets for demonstration or built-in examples, these can be included in the package. The major change when going from script to package is that your code should not assume a working directory.
How to include sample data safely
Put small demonstration files under mypkg/data/.
Use importlib.resources to load data independent of the user’s environment:
This avoids the common “file not found” problem from script-based code.
Those extra files
pyproject.toml
Defines project structure and metadata.
tests (optional)
What are unit tests and why are they important. Using pytest for tests. How to design good unit tests.
2. Publishing a package.
Aside: Liscencing and citation or whatever
a. Putting the package on github
Using source control is important and starting with that from the beginning makes it easier to make the package usable to your later self and others.
b. pypi/testpypi
Making package available for use on PyPI. What is pypi and why would you want stuff on it?
c. github workflows for automated package publishing
What are github workflows. What are good practices for workflows for publishing.
3. Documentation
aside: good vs bad documentation
a. sphinx + readthedocs
Ways to create documentation site.
Docstrings reprised
What goes in a docstring