Session 4 – Object-Oriented Programming and Modeling Libraries

Session Overview

In this session, we’ll explore how Python’s object-oriented nature affects our modeling workflows.

Topics:

Intro to OOP and how it makes modeling in Python different from R
Building and extending classes using inheritance and mixins
Applying OOP to machine learning through demos with scikit-learn

Creating and using models
Plotting data with plotnine and seaborn

Introduction

Why Python? 🐍

R: Built by Statisticians for Statisticians

Excels at:
- Statistical analysis and modeling
- Clean outputs and tables from models
- Beautiful data visualizations with simple code

Python: General-Purpose Language

Excels at:
- Machine Learning, Neural Networks & Deep Learning (scikit-learn, PyTorch, TensorFlow)
- Image & Genomic Data Analysis (scikit-image, biopython, scanpy)
- Software & Command Line Interfaces, Web Scraping, Automation

Python’s broader ecosystem makes it the go-to language in domains like AI, bioinformatics, data engineering, and computational biology.

Note: Packages like rpy2 and reticulate make it possible to use both R and Python in the same project, but those are beyond the scope of this course.
A primer on reticulate is available here: https://www.r-bloggers.com/2022/04/getting-started-with-python-using-r-and-reticulate/

Programming Styles: R vs Python

In the first session, we talked briefly about functional vs object-oriented programming:

Functional programming: focuses on functions as the primary unit of code
Object-oriented programming: uses objects with attached attributes(data) and methods(behaviors)

R leans heavily on the functional paradigm — you pass data into functions and get back results, in most cases without altering the original data. In R, we also use pipes (|>, %>%) to chain functions together.
In Python, everything is an object, even basic things like lists, strings, and dataframes. A lot of ‘functions’ are instead written as object-associated methods. Some of these methods modify the objects in-place by altering their attributes. Understanding how this works is key to using Python effectively!

You’ve already seen this object-oriented style in Sessions 2 and 3 — you create objects like lists or dataframes, then call methods on them like .append() or .sort_values(). In python, instead of piping, we sometimes chain methods together.

Modeling in Python

Python absolutely uses functions—just like R! They’re helpful for data transformation, wrangling, and automation tasks, like looping and parallelization.

But when it comes to modeling, libraries are designed around classes: blueprints for creating objects that store data (attributes) and define behaviors (methods).

scikit-learn is great for getting started— everything follows a simple, consistent OOP interface. Its API is also consistant with other modeling packages, like xgboost and scvi-tools.
scikit-survival is built on top of scikit-learn. https://scikit-survival.readthedocs.io/en/stable/user_guide/00-introduction.html is a good tutorial for it.
PyTorch and TensorFlow are essential if you go deeper into neural networks or custom models— you’ll define your own model classes with attributes and methods, but the basic structure is similar to scikit-learn.
statsmodels is an alternative to scikit-learn for statistical analyses and has R-like syntax and outputs. It’s a bit more complex than scikit-learn and a bit less consistant with other packages in the python ecosystem. https://wesmckinney.com/book/modeling is a good tutorial for statsmodels.

💡 To work effectively in Python, especially for tasks involving modeling or model training, it helps to think in terms of objects and classes, not just functions.

Even though python is more object focused, it still uses functions! They are particualrly helpful for things like data transformation, data wrangling and automation tasks like looping and parallelization. However, when it comes to modeling, libraries are designed around classes, which are like blueprints for creating objects that store data and define behaviors.

These are some of the most commonly used modeling libraries in python. For this session, we will focus on scikit-learn, which is relatively simple and follows a consistent interface. Its API (Application Programming Interface) is also consistant with other modeling packages like xgboost and scvi-tools. scikit-survival is built off of scikit-learn and therefore has a similar API.

Pytorch and tensorflow are more complex but essential for neural networks/deep learning models. With these packages, you define your own model classes, but the basic class structure is similar to sklearn.

Finally, there is statsmodels, which is an alternative to sklearn for statistical analyses. It has R-like syntax and outputs and is a bit more complicated to use than sklearn. Personally, if I wanted R-like outputs I’d use reticulate or just save my data and re-load in R, but statsmodels is available if needed.

Why Does OOP Matter in Python Modeling?

In Python modeling frameworks:

Models are instances of classes
You call methods like .fit(), .predict(), .score()
Internal model details like coefficients or layers are stored as attributes

This makes model behavior consistent between model classes and even libraries. It also simplifies creating/using pre-trained models: both the architecture and learned weights are bundled into a single object with expected built-in methods like .predict() or .fine_tune().

Instead of having a separate results object, like in R, you would retrieve your results by accessing an attribute or using a method that is attached to the model object itself.

We’ll focus on scikit-learn in this session, but these ideas carry over to other libraries like xgboost, statsmodels, and PyTorch.

So why is OOP so important for python modeling? First, models in python are instances of classes that have attached methods like fit, predict and score. The internal model details like coefficients or layers are stored as attributes.

Definining models as instances of classes is useful because it makes model behavior consistant between model classes (due to inheritance which i’ll explain more about shortly) and even libraries. It also simplifies creating/using pre-trained models because the model architecture and learned weights are bundled into a single object that can be loaded. Expected built-in methods are also included in this bundle.

Additionally, instead of having a separate results object like in R, you can retrieve your results by either accessing an attribute or using a method attached to the model object itself.

Part 1: Object-Oriented Programming

Key OOP Principles (Recap)

In OOP, code is structured around objects (as opposed to functions). This paradigm builds off the following principles:

Encapsulation: Bundling data and methods together in a single unit.
- A StandardScaler object stores mean and variance data and has .fit() and .transform() methods

Inheritance: Creating new classes based on existing ones.
- sklearn.LinearRegression inherits attributes and methods from a general regression model class.

Abstraction: Hiding implementation details and exposing only essential functionality.
- e.g., .fit() works the same way from the outside, regardless of model complexity

Polymorphism: Objects of different types can be treated the same way if they implement the same methods.
- Python’s duck typing:
  - 🦆 “If it walks like a duck and quacks like a duck, then it must be a duck.” 🦆
  - ex: If different objects all have a .summarize() method, we can loop over them and call .summarize() without needing to check their class. As long as the method exists, Python will know what to do.
  - This lets us easily create pipelines that can work for many types of models.

We won’t cover pipelines here, but they are worth looking into!

If you read through the OOP pre-reading this will be a review. In OOP, code is structured around objects and this paradigm builds off of the following principles: 1: Encapsulation - bundling data and methods together in a single unit. For example, an object of the Standard Scalar class, which we’ll use later, stores mean and variance data and has fit and transform methods. 2: Inheritance - which allows us to create new classes that inherit attributes and methods from existing classes. For example, the lienar regression model class in sklearn inherits attributes and methods from a general regression model class 3: Abstraction - hiding implementation detals and exposing only essential functionality For example, the fit method works the same way from the outside regardless of model complexity 4: Polymorphism means we can treat objects of different types the same way—as long as they implement the same method.

In Python, this is done through something called duck typing, which comes from the phrase: “If it walks like a duck and quacks like a duck, it’s probably a duck.” In other words, Python doesn’t care what class an object is—it just checks whether it has the method you’re trying to call.

So if three different objects each have a .summarize() method, we can loop through them and call .summarize() without needing to check their type first. As long as the method exists and accepts the right arguments, Python will just run it. This flexibility is one of the biggest reasons OOP is so useful for machine learning. Most model classes in Python—whether it’s logistic regression, random forests, or a neural network—come with a standard set of methods like .fit(), .predict(), and .score(). That consistency means we can swap models in and out with very little change to our code. It also makes it easy to build reusable code—like pipelines—that can work across different model types without needing special handling for each one.

Classes and Objects

Classes are blueprints for creating objects. Each object contains:

Attributes (data): model coefficients, class labels
Methods (behaviors): .fit(), .predict()

👉 To Get the class of an object, use:

type(object) # Returns the type of the object

👉 To check if an object is an instance of a particular class, use:

isinstance(object, class)  # Returns True if `object` is an instance of `class`.

Knowing what class an object belongs to helps us understand what methods and attributes it provides.

Example: Creating a Class

Base Classes

A base class (or parent class) serves as a template for creating objects. Other classes can inherit from it to reuse its properties and methods.

Classes are defined using the class keyword, and their structure is specified using an __init__() method for initialization.

For example, we can define a class called Dog and give it attributes that store data about a given dog and methods that represent behaviors an object of the Dog class can perform. We can also edit the special or “dunder” methods (short for double underscore) that define how objects behave in certain contexts.

class Dog: ## begin class definition
    def __init__(self, name, breed): ## define init method
        self.name = name ## add attributes
        self.breed = breed

    def speak(self): ## add methods
        return f"{self.name} says woof!"

    def __str__(self): # __str__(self) tells python what to display when an object is printed
        return f"Our dog {self.name}"

    def __repr__(self): # add representation to display when dog is called in console
        return f"Dog(name={self.name!r}, breed={self.breed!r})"

A base class serves as a template for creating objects and other classes can inherit its attributes and methods. Classes are defined using the ‘class’ keyword and their structure is specified using an init method for initialization.

For example, we can create a class called Dog and give it attributes and methods. We can also edit special or dunder methods that define how objects of this class behave in certain contexts.

First we will define our init method, which sets up the structure for objects of the dog class. The parameters here (name and breed) can be stored as attributes. The speak method here defines the speech behavior for the dog and returns a string, that changes based on the value stored in the ‘name’ attribute.

We can also define our str and repr methods which tell python what to do when an object is printed (str) or called in console (repr). These methods are useful for displaying information about objects at a glance. We will see an interesting example of a repr method later on.

Creating a dog

Creating an instance of the Dog class lets us model a particular dog:

buddy = Dog("Buddy", "Golden Retriever")
print(f"Buddy is an object of class {type(buddy)}")

Buddy is an object of class <class '__main__.Dog'>

We set the value of the attributes [name and breed], which are then stored as part of the buddy object
We can use any methods defined in the Dog class on buddy

## if we want to see what kind of dog our dog is
## we can call buddy's attributes
print(f"Our dog {buddy.name} is a {buddy.breed}.")

## we can also call any Dog methods
print(buddy.speak())  

## including special methods
buddy ## displays what was in the __repr__() method

Our dog Buddy is a Golden Retriever.
Buddy says woof!

Dog(name='Buddy', breed='Golden Retriever')

Note: For python methods, the self argument is assumed to be passed and therefore we do not put anything in the parentheses when calling .speak(). For attributes, we do not put () at all.

Derived (Child) Classes

Derived/child classes build on base classes using the principle of inheritence.

Now that we have a Dog class, we can build on it to create a specialized GuardDog class.

class GuardDog(Dog):  # GuardDog inherits from Dog
    def __init__(self, name, breed, training_level): ## in addition to name and breed, we can 
        # define a training level. 
        # Call the parent (Dog) class's __init__ method
        super().__init__(name, breed)
        self.training_level = training_level  # New attribute for GuardDog that stores the 
        # training level for the dog

    def guard(self): ## checks if the training level is > 5 and if not says train more
        if self.training_level > 5:
            return f"{self.name} is guarding the house!"
        else:
            return f"{self.name} needs more training before guarding."
    
    def train(self): # modifies the training_level attribute to increase the dog's training level
        self.training_level = self.training_level + 1
        return f"Training {self.name}. {self.name}'s training level is now {self.training_level}"

# Creating an instance of GuardDog
rex = GuardDog("Rex", "German Shepherd", training_level= 5)

Derived or child classes build on base classes using the principle of inheritence. This helps us build more specialized classes without having to start from scratch.

If we want to make a new guardDog class, we can start by inheriting the attributes and methods of the dog class by putting ‘Dog’ in the parentheses here.

When we define our init method, we can first call the parent class’s init method to assign the dog’s name and breed. We can also add an additional attribute that stores the dog’s training level.

Because of inheritence, guarddogs also have the speak method, but we can also give them new methods like guard and train. The guard method, like the speak method earlier, just returns a value. However, the train method actually modifies the training_level attribute of the object in-place. This means that, if we want to train a guarddog, we do not have to assign the trained dog to a new variable to preserve the trained state.

Let’s make an instance of guarddog called rex and we can see how this works.

Now that we have a dog (rex), we can call on any of the methods/attributes introduced in the Dog class as well as the new GuardDog class.

Using methods from the base class:

print(rex.speak())
rex

Rex says woof!

Dog(name='Rex', breed='German Shepherd')

This is the power of inheritance—we don’t have to rewrite everything from scratch!

Using a method from the child class:

print(f"{rex.name}'s training level is {rex.training_level}.")
print(rex.guard())

Rex's training level is 5.
Rex needs more training before guarding.

Unlike standalone functions, methods in Python often update objects in-place—meaning they modify the object itself rather than returning a new one.

We can use the .train() method to increase rex’s training level.

print(rex.train())

Training Rex. Rex's training level is now 6

Be Careful!!!

Calling rex.train() within a print statement still updates rex’s training level. If we were to do this instead:

rex.train()
print(rex.train())

it would train rex twice!

Now if we check,

print(f"{rex.name}'s training level is {rex.training_level}.")
print(rex.guard())

Rex's training level is 6.
Rex is guarding the house!

As with Rex, child classes inherit all attributes (.name and .breed) and methods (.speak() __repr__()) from parent classes. They can also have new methods (.train()).

Mixins

A mixin is a special kind of class designed to add functionality to another class. Unlike base classes, mixins aren’t used alone.

For example, scikit-learn uses mixins like:
- sklearn.base.ClassifierMixin (adds classifier-specific methods)
- sklearn.base.RegressorMixin (adds regression-specific methods)

which it adds to the BaseEstimator class to add functionality.

To finish up our dog example, we are going to define a mixin class that adds learning tricks to the base Dog class and use it to create a new class called SmartDog.

Mixins area special kind of class that take advantage of the inheritence property to add functionality to other classes without needing to re-define the initial class structure. Unlike base classes, mixins are not useful on their own.

Some examples of mixin classes are ClassifierMixin and RegressorMixin from scikit learn, which add functionality to the BaseEstimator class. To finish up our dog example, we are going to define a mixin class that adds learning tricks to the base Dog class and use it to create a new class called SmartDog.

When creating a mixin class, we let the other base classes carry most of the initialization

class TrickMixin: ## mixin that will let us teach a dog tricks
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)  # Ensures proper initialization in multi inheritance
        self.tricks = []  # Add attribute to store tricks

## add trick methods
    def learn_trick(self, trick):
        """Teaches the dog a new trick."""
        if trick not in self.tricks:
            self.tricks.append(trick)
            return f"{self.name} learned a new trick: {trick}!"
        return f"{self.name} already knows {trick}!"

    def perform_tricks(self):
        """Returns a list of tricks the dog knows."""
        if self.tricks:
            return f"{self.name} can perform: {', '.join(self.tricks)}."
        return f"{self.name} hasn't learned any tricks yet."

## note: the TrickMixin class is not a standalone class!

By including both Dog and TrickMixin as base classes, we give objects of class SmartDog the ability to speak and learn tricks!

class SmartDog(Dog, TrickMixin):
    def __init__(self, name, breed):
        super().__init__(name, breed)  # Initialize Dog class
        TrickMixin.__init__(self)  # Initialize TrickMixin separately

# a SmartDog object can use methods from both parent object `Dog` and mixin `TrickMixin`.
my_smart_dog = SmartDog("Buddy", "Border Collie")
print(my_smart_dog.speak())

Buddy says woof!

print(my_smart_dog.learn_trick("Sit"))  
print(my_smart_dog.learn_trick("Roll Over")) 
print(my_smart_dog.learn_trick("Sit"))

Buddy learned a new trick: Sit!
Buddy learned a new trick: Roll Over!
Buddy already knows Sit!

print(my_smart_dog.perform_tricks())

Buddy can perform: Sit, Roll Over.

Duck Typing

🦆 “If it quacks like a duck and walks like a duck, it’s a duck.” 🦆

Python’s duck typing makes our lives a lot easier, and is one of the main benefits of methods over functions:

Inheritence - objects inherit methods from base classes
Repurposing old code - methods by the same name work the same for different model types
Use methods without checking types - methods are assumed to work on the object they’re attached to

We can demonstrate duck typing by defining two new base classes that are different than Dog but also have a speak() method.

class Human:
    def __init__(self, name):
        self.name = name

    def speak(self):
        return f"{self.name} says hello!"

class Parrot:
    def __init__(self, name):
        self.name = name

    def speak(self):
        return f"{self.name} says squawk!"

Duck Typing in Action

Even though Dog, Human and Parrot are entirely different classes…

def call_speaker(obj):
    print(obj.speak())

call_speaker(Dog("Fido", "Labrador"))
call_speaker(Human("Alice"))
call_speaker(Parrot("Polly"))

Fido says woof!
Alice says hello!
Polly says squawk!

They all implement .speak(), so Python treats them the same!

In the context of our work, this would allow us to make a pipeline using models from different libraries that have the same methods.

While our dog example was very simple, this is the same way that model classes work in python!

Warning

With duck typing, Python lets us use methods without breaking. It does not mean that any given method is correct to use in all cases, or that all similar objects will have the same methods.

Example: OOP in Machine Learning and Modeling

Machine learning models in Python are implemented as classes.

When you create a model, you’re instantiating an object of a predefined class (e.g., LogisticRegression()).
That model has attributes (parameters, coefficients) and methods (like .fit() and .predict()).

For example LogisticRegression is a model class that inherits from SparseCoefMixin and BaseEstimator.

class LogisticRegression(LinearClassifierMixin, SparseCoefMixin, BaseEstimator):

To perform logistic regression, we create an instance of the LogisticRegression class.

## Example: 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()  # Creating an instance of the LogisticRegression class
model.fit(X_train, y_train)   # Calling a method to train the model
predictions = model.predict(X_test)  # Calling a method to make predictions
coefs = model.coef_ # Access model coefficients using attribute

Key Benefits of OOP in Machine Learning

Encapsulation – Models store parameters and methods inside a single object.
Inheritance – New models can build on base models, reusing existing functionality.
Abstraction – .fit() should work as expected, regardless of complexity of underlying implimentation.
Polymorphism (Duck Typing) – Different models share the same method names (.fit(), .predict()), making them easy to use interchangeably, particularly in analysis pipelines.

Understanding base classes and mixins is especially important when working with deep learning frameworks like PyTorch and TensorFlow, which require us to create our own model classes.

Part B - Demo Projects

Apply knowledge of OOP to modeling using scikit-learn

🐧 Mini Project: Classifying Penguins with scikit-learn

Now that you understand classes and data structures in Python, let’s apply that knowledge to classify penguin species using two features:

bill_length_mm
bill_depth_mm

We’ll explore:

Unsupervised learning with K-Means clustering (model doesn’t ‘know’ y)
Supervised learning with a k-NN classifier (model trained w/ y information)

All scikit-learn models are designed to have

Common Methods:

.fit() — Train the model
.predict() — Make predictions

Common Attributes:

.classes_, .n_clusters_, etc.

This is true of the scikit-survival package too!

Import Libraries

Before any analysis, we must import the necessary libraries.

For large libraries like scikit-learn, PyTorch, or TensorFlow, we usually do not import the entire package. Instead, we selectively import the classes and functions we need.

Classes
- StandardScaler — for feature scaling
- KNeighborsClassifier — for supervised k-NN classification
- KMeans — for unsupervised clustering

🔤 Naming Tip:
- CamelCase = Classes
- snake_case = Functions

Functions
- train_test_split() — to split data into training and test sets
- accuracy_score() — to evaluate classification accuracy
- classification_report() — to print precision, recall, F1 (balance of precision and recall), Support (number of true instances per class) - adjusted_rand_score() — to evaluate clustering performance

Import Libraries

## imports
import pandas as pd
import numpy as np

from plotnine import *
import seaborn as sns
import matplotlib.pyplot as plt

from great_tables import GT

## sklearn imports

## import classes
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans

## import functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, adjusted_rand_score

Data Preparation

# Load the Penguins dataset
penguins = sns.load_dataset("penguins").dropna()

# Make a summary table for the penguins dataset, grouping by species. 
summary_table = penguins.groupby("species").agg({
    "bill_length_mm": ["mean", "std", "min", "max"],
    "bill_depth_mm": ["mean", "std", "min", "max"],
    "sex": lambda x: x.value_counts().to_dict()  # Count of males and females
})

# Round numeric values to 1 decimal place (excluding the 'sex' column)
for col in summary_table.columns:
    if summary_table[col].dtype in [float, int]:
        summary_table[col] = summary_table[col].round(1)

# Display the result
display(summary_table)

	bill_length_mm				bill_depth_mm				sex
	mean	std	min	max	mean	std	min	max	<lambda>
species
Adelie	38.8	2.7	32.1	46.0	18.3	1.2	15.5	21.5	{'Male': 73, 'Female': 73}
Chinstrap	48.8	3.3	40.9	58.0	18.4	1.1	16.4	20.8	{'Female': 34, 'Male': 34}
Gentoo	47.6	3.1	40.9	59.6	15.0	1.0	13.1	17.3	{'Male': 61, 'Female': 58}

Data Visualization

To do visualization, we can use either seaborn or plotnine. plotnine mirrors ggplot2 syntax from R and is great for layered grammar-of-graphics plots, while seaborn is more convienient if you want to put multiple plots on the same figure.

Plotting with Plotnine vs Seaborn

Plotnine (like ggplot2 in R)
The biggest differences between plotnine and ggplot2 syntax are:

With plotnine the whole call is wrapped in () parentheses
Variables are called with strings ("" are needed!)
If you don’t use from plotnine import *, you will need to import each individual function you plan to use!

Seaborn (base matplotlib + enhancements)

Designed for quick, polished plots
Works well with pandas DataFrames or NumPy arrays
Integrates with matplotlib for customization
Good for things like decision boundaries or heatmaps
Harder to customize than plotnine plots

Scatterplot with plotnine

To take a look at the distribution of our species by bill length and bill depth before clustering…

plot1 = (ggplot(penguins, aes(x="bill_length_mm", y="bill_depth_mm", color="species"))
 + geom_point()
 + ggtitle("Penguin Species")
 + theme_bw())

display(plot1)

Scatterplot with seaborn

We can make a similar plot in seaborn. This time, let’s include sex by setting the point style

# Create the figure and axes obects
fig, ax = plt.subplots(figsize=(10, 8))

# Create a plot 
sns.scatterplot(
    data=penguins, x="bill_length_mm", y="bill_depth_mm",
    hue="species", ## hue = fill
    style="sex",  ## style = style of dots
    palette="Set2", ## sets color pallet
    edgecolor="black", s=300, ## line color and point size 
    ax=ax              ## Draw plot on ax      
)

# Use methods on ax to set title, labels
ax.set_title("Penguin Bill Length vs Depth by Species")
ax.set_xlabel("Bill Length (mm)")
ax.set_ylabel("Bill Depth (mm)")
ax.legend(title="Species")

# Plot the figure
fig.tight_layout() 
#fig.show() -> if not in interactive

Scatterplot with seaborn

Scaling the data - Understanding the Standard Scaler class

For our clustering to work well, the predictors should be on the same scale. To achieve this, we use an instance of the StandardScaler class.

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

Parameters are supplied by user
- copy, with_mean, with_std

Attributes contain the data of the object
- scale_: scaling factor
- mean_: mean value for each feature
- var_: variance for each feature
- n_features_in_: number of features seen during fit
- n_samples_seen: number of samples processed for each feature

Methods describe the behaviors of the object and/or modify its attributes
- fit(X): computes mean and std used for scaling and ‘fits’ scaler to data X
- transform(X): performs standardization by centering and scaling X with fitted scaler
- fit_transform(X): does both

Scaling Data

# Selecting features for clustering -> let's just use bill length and bill depth.
X = penguins[["bill_length_mm", "bill_depth_mm"]]
y = penguins["species"]

# Standardizing the features for better clustering performance
scaler = StandardScaler() ## create instance of StandardScaler
X_scaled = scaler.fit_transform(X)

Feature	Original		Scaled
Original vs Scaled Features
Feature	Bill Length	Bill Depth	Bill Length	Bill Depth
mean	44	17	0	0
std	5	2	1	1

Show table code

## Make X_scaled a pandas df
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Compute summary statistics and round to 2 sig figs
original_stats = X.agg(["mean", "std"])
scaled_stats = X_scaled_df.agg(["mean", "std"])

# Combine into a single table with renamed columns
summary_table = pd.concat([original_stats, scaled_stats], axis=1)
summary_table.columns = ["Bill_Length_o", "Bill_Depth_o", "Bill_Length_s", "Bill_Depth_s"]
summary_table.index.name = "Feature"

# Display nicely with great_tables
(
    GT(summary_table.reset_index()).tab_header("Original vs Scaled Features")
    .fmt_number(columns =  ["Bill_Length_o", "Bill_Depth_o", "Bill_Length_s", "Bill_Depth_s"], decimals=0)
    .tab_spanner(label="Original", columns=["Bill_Length_o", "Bill_Depth_o"])
    .tab_spanner(label="Scaled", columns=["Bill_Length_s", "Bill_Depth_s"])
    .cols_label(Bill_Length_o = "Bill Length", Bill_Depth_o = "Bill Depth", Bill_Length_s = "Bill Length", Bill_Depth_s = "Bill Depth")
    .tab_options(table_font_size = 16)
)

Understanding the KMeans model class

class sklearn.cluster.KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300, 
tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')

Parameters: Set by user at time of instantiation
- n_clusters, max_iter, algorithm

Attributes: Store object data
- cluster_centers_: stores coordinates of cluster centers
- labels_: stores labels of each point - n_iter_: number of iterations run (will be changed during method run)
- n_features_in and feature_names_in_: store info about features seen during fit

Methods: Define object behaviors
- fit(X): fits model to data X - predict(X): predicts closest cluster each sample in X belongs to
- transform(X): transforms X to cluster-distance space

Create model

## Choosing 3 clusters b/c we have 3 species
kmeans = KMeans(n_clusters=3, random_state=42) ## make an instance of the K means class
kmeans

KMeans(n_clusters=3, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Fit model to data

## the fit
penguins["kmeans_cluster"] = kmeans.fit_predict(X_scaled)

## now that we fit the model, we should have cluster centers
print("Coordinates of cluster centers:", kmeans.cluster_centers_)

## shows that model is fitted
kmeans

Coordinates of cluster centers: [[-0.95023997  0.55393493]
 [ 0.58644397 -1.09805504]
 [ 1.0886843   0.79503579]]

KMeans(n_clusters=3, random_state=42)

Now we create our model, setting the number of clusters to 3 because we have 3 species. One of the fun things about sklearn is that, the repr method gives us this cute little block representation of our model. It shows the parameters that we set during instantiation as well as the model status (the i icon) and link to the model documentation (the ? icon). For non-fitted models, the box is orange and, once we fit the model, the box will be blue!

We can use the fit-predict method to fit the model to X_scaled and return the cluster labels, which we can store in the penguins dataframe as ‘kmeans_cluster’.

Now that the model has been fit, we can check out the cluster centers (remember that this is in scaled X space) and also look at our model representation again to see that it’s turned blue.

Use function to calculate ARI

To check how good our model is, we can use one of the functions included in the sklearn library.

The adjusted_rand_score() function evaluates how well the cluster groupings agree with the species groupings while adjusting for chance.

# Calculate clustering performance using Adjusted Rand Index (ARI)
kmeans_ari = adjusted_rand_score(penguins['species'], penguins["kmeans_cluster"])
print(f"k-Means Adjusted Rand Index: {kmeans_ari:.2f}")

k-Means Adjusted Rand Index: 0.82

We can also use methods on our data structure to create new data

We can use the .groupby() method to help us plot cluster agreement with species label as a heatmap
If we want to add sex as a variable to see if that is why our clusters don’t agree with our species, we can use a scatterplot
Using seaborn and matplotlib, we can easily put both of these plots on the same figure.

# Count occurrences of each species-cluster-sex combination
# (.size gives the count as index, use reset_index to get count column.)
scatter_data = (penguins.groupby(["species", "kmeans_cluster", "sex"])
                .size()
                .reset_index(name="count"))
species_order = list(scatter_data['species'].unique()) ## defining this for later

# Create a mapping to add horizontal jitter for each sex for scatterplot
sex_jitter = {'Male': -0.1, 'Female': 0.1}
scatter_data['x_jittered'] = scatter_data.apply(
    lambda row: scatter_data['species'].unique().tolist().index(row['species']) +
     sex_jitter.get(row['sex'], 0),
    axis=1
)

heatmap_data = scatter_data.pivot_table(index="kmeans_cluster", columns="species", 
values="count", aggfunc="sum", fill_value=0)

Scatter data & Heatmap Data

display(scatter_data.head(3))

	species	kmeans_cluster	sex	count	x_jittered
0	Adelie	0	Female	73	0.1
1	Adelie	0	Male	69	-0.1
2	Adelie	2	Male	4	-0.1

display(heatmap_data)

species	Adelie	Chinstrap	Gentoo
kmeans_cluster
0	142	5	1
1	0	9	112
2	4	54	6

Creating Plots

# Prepare the figure with 2 subplots; the axes object will contain both plots
fig2, axes = plt.subplots(1, 2, figsize=(16, 7)) ## 1 row 2 columns

# Plot heatmap on the first axis
sns.heatmap(data = heatmap_data, cmap="Blues", linewidths=0.5, linecolor='white', annot=True, 
fmt='d', ax=axes[0]) ## fmt='d' = decimal (base10) integer, use fmt='f' for floats 
axes[0].set_title("Heatmap of KMeans Clustering by Species")
axes[0].set_xlabel("Species")
axes[0].set_ylabel("KMeans Cluster")

# Scatterplot with jitter
sns.scatterplot(data=scatter_data, x="x_jittered", y="kmeans_cluster",
    hue="species", style="sex", size="count", sizes=(100, 500),
    alpha=0.8, ax=axes[1], legend="brief")
axes[1].set_xticks(range(len(species_order)))
axes[1].set_xticklabels(species_order)
axes[1].set_title("Cluster Assignment by Species and Sex (Jittered)")
axes[1].set_ylabel("KMeans Cluster")
axes[1].set_xlabel("Species")
axes[1].set_yticks([0, 1, 2])
axes[1].legend(bbox_to_anchor=(1.05, 0.5), loc='center left', borderaxespad=0.0, title="Legend")

fig2.tight_layout()
#fig2.show()

Now we can create our plots on this new figure, fig2. Because we want to create more than 1 plot, we have additional arguments in plt.subplots. 1, 2 means 1 row 2 columns.

This time, our ‘axis’ is actually a 1d array of axis objects. If we had more than 1 row, it would be a 2d array (like a matrix). -adv- If we want to plot the heatmap on the first ‘ax’, we can set ax=axes[0]. Here, we don’t have to explicitly set X and Y, we can just pass the data into the heatmap function.

-adv- For our scatterplot, we need to set x and y explicitly, using the x_jittered column so our points don’t overlap. To make things clean here, we can set the xticks using the species_order we defined previously and put our legend to the left of the plot.

Note: bbox_to_anchor -> x coord, y coord in terms of plot size. Aka a little more than 1 plot to the left and half a plot up.

Creating Plots

Project 2: KNN classification

For our KNN classification, the model is supervised (meaning it is dependent on the outcome ‘y’ data). This time, we need to split our data into a training and test set.

The function train_test_split() from scikit-learn is helpful here!

# Splitting dataset into training and testing sets (still using scaled X!)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

Unlike R functions, which return a single object (often a list when multiple outputs are needed), Python functions can return multiple values as a tuple—letting you unpack them directly into separate variables.

Understanding KNeighborsClassifier class

class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', 
algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

Parameters: Set by user at time of instantiation
- n_neigbors, weights, algorithm, etc.

Attributes: Store object data
- classes_: class labels known to the classifier
- effective_metric_: distance metric used
- effective_metric_params_: parameters for the metric function
- n_features_in and feature_names_in_: store info about features seen during fit
- n_samples_fit_: number of samples in fitted data

Methods: Define object behaviors
- .fit(X, y): fit knn classifier from training dataset (X and y)
- .predict(X): predict class labels for provided data X
- .predict_proba(X): return probability estimates for test data X
- .score(X, y): return mean accuracy on given test data X and labels y

Making an instance of KNeighborsClassifier and fitting to training data

For a supervised model, y_train is included in .fit()!

## perform knn classification
# Applying k-NN classification with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5) ## make an instance of the KNeighborsClassifier class
# and set the n_neighbors parameter to be 5. 

# Use the fit method to fit the model to the training data
knn.fit(X_train, y_train)
knn

KNeighborsClassifier()

Once the model is fit…

-We can look at its attributes (ex: .classes_) which gives the class labels as known to the classifier

print(knn.classes_)

['Adelie' 'Chinstrap' 'Gentoo']

-And use fitted model to predict species for test data

# Use the predict method on the test data to get the predictions for the test data
y_pred = knn.predict(X_test)

# Also can take a look at the prediction probabilities, 
# and use the .classes_ attribute to put the column labels in the right order
probs = pd.DataFrame(
    knn.predict_proba(X_test),
    columns = knn.classes_)
probs['y_pred'] = y_pred

print("Predicted probabilities: \n", probs.head())

Predicted probabilities: 
    Adelie  Chinstrap  Gentoo     y_pred
0     1.0        0.0     0.0     Adelie
1     0.0        0.0     1.0     Gentoo
2     1.0        0.0     0.0     Adelie
3     0.0        0.6     0.4  Chinstrap
4     1.0        0.0     0.0     Adelie

Scatterplot for k-NN classification of test data

Create dataframe of unscaled X_test, bill_length_mm, and bill_depth_mm.
Add to it the actual and predicted species labels

## First unscale the test data
X_test_unscaled = scaler.inverse_transform(X_test)

## create dataframe 
penguins_test = pd.DataFrame(
    X_test_unscaled,
    columns=['bill_length_mm', 'bill_depth_mm']
)

## add actual and predicted species 
penguins_test['y_actual'] = y_test.values
penguins_test['y_pred'] = y_pred
penguins_test['correct'] = penguins_test['y_actual'] == penguins_test['y_pred']

print("Results: \n", penguins_test.head())

Results: 
    bill_length_mm  bill_depth_mm   y_actual     y_pred  correct
0            39.5           16.7     Adelie     Adelie     True
1            46.9           14.6     Gentoo     Gentoo     True
2            42.1           19.1     Adelie     Adelie     True
3            49.8           17.3  Chinstrap  Chinstrap     True
4            41.1           18.2     Adelie     Adelie     True

Plotnine scatterplot for k-NN classification of test data

To see how well our model did at classifying the remaining penguins…

## Build the plot
plot3 = (ggplot(penguins_test, aes(x="bill_length_mm", y="bill_depth_mm", 
color="y_actual", fill = 'y_pred', shape = 'correct'))
 + geom_point(size=4, stroke=1.1)  # Stroke controls outline thickness
 + scale_shape_manual(values={True: 'o', False: '^'})  # Circle and triangle
 + ggtitle("k-NN Classification Results")
 + theme_bw())

display(plot3)

Plotnine scatterplot for k-NN classification of test data

Visualizing Decision Boundary with seaborn and matplotlib

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.preprocessing import LabelEncoder

# Create and fit label encoder for y (just makes y numeric because it makes the scatter plot happy)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Create the plot objects
fig, ax = plt.subplots(figsize=(12, 8))

# Create display object
disp = DecisionBoundaryDisplay.from_estimator(
    knn,
    X_test,
    response_method = 'predict',
    plot_method = 'pcolormesh',
    xlabel = "bill_length_scaled",
    ylabel = "bill_depth_scaled",
    shading = 'auto',
    alpha = 0.5,
    ax = ax
)

# Use method from display object to create scatter plot
scatter = disp.ax_.scatter(X_scaled[:,0], X_scaled[:,1], c=y_encoded, edgecolors = 'k')
disp.ax_.legend(scatter.legend_elements()[0], knn.classes_, loc = 'lower left', title = 'Species')
_ = disp.ax_.set_title("Penguin Classification")

fig.show()

Visualizing Decision Boundary with seaborn and matplotlib

Evaluate KNN performance

To check the performance of our KNN classifier, we can check the accuracy score and print a classification report.
- accuracy_score and classification_report are both functions!
- They are not unique to scikit-learn classes so it makes sense for them to be functions not methods

## eval knn performance
knn_accuracy = accuracy_score(y_test, y_pred)
print(f"k-NN Accuracy: {knn_accuracy:.2f}")
print("Classification Report: \n", classification_report(y_test, y_pred))

k-NN Accuracy: 0.94
Classification Report: 
               precision    recall  f1-score   support

      Adelie       0.98      0.98      0.98        48
   Chinstrap       0.80      0.89      0.84        18
      Gentoo       0.97      0.91      0.94        34

    accuracy                           0.94       100
   macro avg       0.92      0.93      0.92       100
weighted avg       0.94      0.94      0.94       100

Make a Summary Table of Metrics for Both Models

summary_table = pd.DataFrame({
    "Metric": ["k-Means Adjusted Rand Index", "k-NN Accuracy"],
    "Value": [kmeans_ari, knn_accuracy]
})
(
    GT(summary_table)
    .tab_header(title = "Model Results Summary")
    .fmt_number(columns = "Value", n_sigfig = 2)
    .tab_options(table_font_size = 20)
)

Metric	Value
Model Results Summary
k-Means Adjusted Rand Index	0.82
k-NN Accuracy	0.94

Key Takeaways from This Session

Python workflows rely on object-oriented structures in addition to functions:
Understanding the OOP paradigm makes Python a lot easier!
Everything is an object!
Duck Typing:
If an object has a method, that method can be called regardless of the object type. Caveat being, make sure the arguments (if any) in the method are specified correctly for all objects!
Python packages use common methods that make it easy to change between model types without changing a lot of code.

Additional Insights

Predictable APIs enable seamless model switching:
Swapping models like LogisticRegression → RandomForestClassifier usually requires minimal code changes.
scikit-learn prioritizes interoperability:
Its consistent class design integrates with tools like Pipeline, GridSearchCV, and cross_val_score.
Class attributes improve model transparency:
Access attributes like .coef_, .classes_, and .feature_importances_ for model interpretation and debugging.
Custom classes are central to deep learning:
Frameworks like PyTorch and TensorFlow require you to define your own model classes by subclassing base models.
Mixins support modular design:
Mixins (e.g., ClassifierMixin) let you add specific functionality without duplicating code.

‘Homework’

The homework for this session can be found at “H:Exercises_homework_blank.ipynb”
Please copy the file! Do not modify the file on the H drive!

Also on the H drive, under the ‘Solutions’ subfolder, are an html file and a jupyter notebook with sample solutions for the homework.

	n_clusters	3
	init	'k-means++'
	n_init	'auto'
	max_iter	300
	tol	0.0001
	verbose	0
	random_state	42
	copy_x	True
	algorithm	'lloyd'

	n_neighbors	5
	weights	'uniform'
	algorithm	'auto'
	leaf_size	30
	p	2
	metric	'minkowski'
	metric_params	None
	n_jobs	None

Session 4 – Object-Oriented Programming and Modeling Libraries

Session Overview

Introduction

Why Python? 🐍

R: Built by Statisticians for Statisticians

Python: General-Purpose Language

Programming Styles: R vs Python

Modeling in Python

Why Does OOP Matter in Python Modeling?

Part 1: Object-Oriented Programming

Key OOP Principles (Recap)

Classes and Objects

Example: Creating a Class

Base Classes

Creating a dog

Derived (Child) Classes

Mixins

Duck Typing

Duck Typing in Action

Example: OOP in Machine Learning and Modeling

Key Benefits of OOP in Machine Learning

Part B - Demo Projects

🐧 Mini Project: Classifying Penguins with scikit-learn

Import Libraries

Import Libraries

Data Preparation

Data Visualization

Plotting with Plotnine vs Seaborn

Scatterplot with plotnine

Scatterplot with seaborn

Scatterplot with seaborn

Scaling the data - Understanding the Standard Scaler class

Scaling Data

Understanding the KMeans model class

Create model

Fit model to data

Use function to calculate ARI

We can also use methods on our data structure to create new data

Scatter data & Heatmap Data

Creating Plots

Creating Plots

Project 2: KNN classification

Understanding KNeighborsClassifier class

Making an instance of KNeighborsClassifier and fitting to training data

Once the model is fit…

Scatterplot for k-NN classification of test data

Plotnine scatterplot for k-NN classification of test data

Plotnine scatterplot for k-NN classification of test data

Visualizing Decision Boundary with seaborn and matplotlib

Visualizing Decision Boundary with seaborn and matplotlib

Evaluate KNN performance

Make a Summary Table of Metrics for Both Models

Key Takeaways from This Session

Additional Insights

‘Homework’

Pre-Reading for This Session