Session 4 – Pre-read
KNN vs K-Means: Supervised vs Unsupervised Learning
We will be using sci-kit learn
implementations of both of these algorithms during the session 4 tutorial.
K-Nearest Neighbors (KNN)
KNN is a supervised learning algorithm used for classification (and sometimes regression):
- You train the model on labeled data (i.e., you know the “answer” or class).
- When predicting a new sample, the model finds the k training samples closest to it (its “neighbors”) and uses them to assign a label.
- Closeness is usually based on Euclidean distance.
Example: Given a penguin with known bill length and depth, predict its species by looking at its 5 nearest neighbors in the training data.
K-Means Clustering
K-Means is an unsupervised learning algorithm used for clustering:
- You do not provide the true labels.
- The algorithm tries to split your data into k groups based on similarity.
- It randomly initializes cluster centers, assigns points to the nearest one, then updates the centers iteratively.
Example: Given penguin data without species labels, group them into 3 clusters based on bill length and depth.
Key Differences
Feature | KNN | K-Means |
---|---|---|
Learning Type | Supervised | Unsupervised |
Goal | Classification (or Regression) | Clustering |
Input Labels | Required | Not used |
Output | Predicted class | Cluster assignment |
Model Type | Lazy (no training phase) | Iterative center updates |
Plotting in Python
Please read the ‘Parts of a Figure’ and ‘Coding Styles’ sections of Quick Start Guide (Matplotlib). We will briefly cover plotting with Seaborn
(which is built on top of the Matplotlib
package), but will not spend much time talking about base Matplotlib
.
Optional Reading
Introduction to Object-Oriented Programming (OOP) in Python
We will cover the basics of object-oriented programming and how it relates to analysis workflows in Python during session 4, but Introduction to OOP in Python (Real Python) explains in greater detail and includes some practice exercises.