Unsupervised Learning
> No labels, no teacher. The algorithm finds structure on its own.
Type: Build
Languages: Python
Prerequisites: Phase 1 (Norms & Distances, Probability & Distributions), Phase 2 Lessons 1-6
Time: ~90 minutes
Learning Objectives
- Implement K-Means, DBSCAN, and Gaussian Mixture Models from scratch and compare their clustering behavior
- Evaluate cluster quality using the silhouette score and the elbow method to select the optimal K
- Explain when DBSCAN outperforms K-Means and identify which algorithm handles non-spherical clusters and outliers
- Build an anomaly detection pipeline using clustering methods to flag points that deviate from normal patterns
The Problem
Every ML lesson so far has assumed labeled data: "here is an input, here is the correct output." In the real world, labels are expensive. A hospital has millions of patient records but no one has manually tagged each one with a disease category. An e-commerce site has millions of user sessions but no one has hand-labeled customer segments. A security team has network logs but nobody has flagged every anomaly.
Unsupervised learning finds patterns without being told what to look for. It groups similar data points, discovers hidden structures, and surfaces anomalies. If supervised learning is learning from a textbook with an answer key, unsupervised learning is staring at raw data until the patterns reveal themselves.
The catch: without labels, you cannot directly measure "right" or "wrong." You need different tools to evaluate whether the structure your algorithm found is meaningful.
The Concept
Clustering: Grouping Similar Things Together
Clustering assigns each data point to a group (cluster) so that points within the same group are more similar to each other than to points in other groups. The question is always: what does "similar" mean?
K-Means: The Workhorse
K-Means partitions data into exactly K clusters. Each cluster has a centroid (its center of mass), and every point belongs to the nearest centroid.
Lloyd's algorithm:
- Pick K random points as initial centroids
- Assign each data point to the nearest centroid
- Recompute each centroid as the mean of its assigned points
- Repeat steps 2-3 until assignments stop changing
The objective function (inertia) measures the total squared distance from each point to its assigned centroid. K-Means minimizes this, but only finds a local minimum. Different initializations can give different results.
Choosing K
Two standard methods:
Elbow method: Run K-Means for K = 1, 2, 3, ..., n. Plot inertia vs K. Look for the "elbow" where adding more clusters stops reducing inertia significantly.
Silhouette score: For each point, measure how similar it is to its own cluster (a) versus the nearest other cluster (b). The silhouette coefficient is (b - a) / max(a, b), ranging from -1 (wrong cluster) to +1 (well-clustered). Average across all points for a global score.
DBSCAN: Density-Based Clustering
K-Means assumes clusters are spherical and requires you to pick K upfront. DBSCAN makes neither assumption. It finds clusters as dense regions separated by sparse regions.
Two parameters:
- eps: the radius of a neighborhood
- min_samples: the minimum number of points needed to form a dense region
Three types of points:
- Core point: has at least min_samples points within eps distance
- Border point: within eps of a core point but not itself a core point
- Noise point: neither core nor border. These are outliers.
DBSCAN connects core points that are within eps of each other into the same cluster. Border points join the cluster of a nearby core point. Noise points belong to no cluster.
Strengths: finds clusters of any shape, automatically determines the number of clusters, identifies outliers. Weakness: struggles with clusters of varying densities.
Hierarchical Clustering
Builds a tree (dendrogram) of nested clusters.
Agglomerative (bottom-up):
- Start with each point as its own cluster
- Merge the two closest clusters
- Repeat until only one cluster remains
- Cut the dendrogram at the desired level to get K clusters
The "closeness" between clusters can be measured as:
- Single linkage: minimum distance between any two points in the two clusters
- Complete linkage: maximum distance between any two points
- Average linkage: average distance between all pairs
- Ward's method: the merge that causes the smallest increase in total within-cluster variance
Gaussian Mixture Models (GMM)
K-Means gives hard assignments: each point belongs to exactly one cluster. GMM gives soft assignments: each point has a probability of belonging to each cluster.
GMM assumes the data is generated from a mixture of K Gaussian distributions, each with its own mean and covariance. The Expectation-Maximization (EM) algorithm alternates between:
- E-step: compute the probability that each point belongs to each Gaussian
- M-step: update the mean, covariance, and mixing weight of each Gaussian to maximize the likelihood of the data
GMM can model elliptical clusters (not just spherical like K-Means) and naturally handles overlapping clusters.
When to Use Which
| Method | Best for | Avoid when |
|---|---|---|
| K-Means | Large datasets, spherical clusters, known K | Irregular shapes, outliers present |
| DBSCAN | Unknown K, arbitrary shapes, outlier detection | Varying densities, very high dimensions |
| Hierarchical | Small datasets, need dendrogram, unknown K | Large datasets (O(n^2) memory) |
| GMM | Overlapping clusters, soft assignments needed | Very large datasets, too many dimensions |
Anomaly Detection with Clustering
Clustering naturally supports anomaly detection:
- K-Means: points far from any centroid are anomalies
- DBSCAN: noise points are anomalies by definition
- GMM: points with low probability under all Gaussians are anomalies
Build It
Step 1: K-Means from scratch
import math
import random
def euclidean_distance(a, b):
return math.sqrt(sum((ai - bi) ** 2 for ai, bi in zip(a, b)))
def kmeans(data, k, max_iterations=100, seed=42):
random.seed(seed)
n_features = len(data[0])
centroids = random.sample(data, k)
for iteration in range(max_iterations):
clusters = [[] for _ in range(k)]
assignments = []
for point in data:
distances = [euclidean_distance(point, c) for c in centroids]
nearest = distances.index(min(distances))
clusters[nearest].append(point)
assignments.append(nearest)
new_centroids = []
for cluster in clusters:
if len(cluster) == 0:
new_centroids.append(random.choice(data))
continue
centroid = [
sum(point[j] for point in cluster) / len(cluster)
for j in range(n_features)
]
new_centroids.append(centroid)
if all(
euclidean_distance(old, new) < 1e-6
for old, new in zip(centroids, new_centroids)
):
print(f" Converged at iteration {iteration + 1}")
break
centroids = new_centroids
return assignments, centroids
Step 2: Elbow method and silhouette score
def compute_inertia(data, assignments, centroids):
total = 0.0
for point, cluster_id in zip(data, assignments):
total += euclidean_distance(point, centroids[cluster_id]) ** 2
return total
def silhouette_score(data, assignments):
n = len(data)
if n < 2:
return 0.0
clusters = {}
for i, c in enumerate(assignments):
clusters.setdefault(c, []).append(i)
if len(clusters) < 2:
return 0.0
scores = []
for i in range(n):
own_cluster = assignments[i]
own_members = [j for j in clusters[own_cluster] if j != i]
if len(own_members) == 0:
scores.append(0.0)
continue
a = sum(euclidean_distance(data[i], data[j]) for j in own_members) / len(own_members)
b = float("inf")
for cluster_id, members in clusters.items():
if cluster_id == own_cluster:
continue
avg_dist = sum(euclidean_distance(data[i], data[j]) for j in members) / len(members)
b = min(b, avg_dist)
if max(a, b) == 0:
scores.append(0.0)
else:
scores.append((b - a) / max(a, b))
return sum(scores) / len(scores)
def find_best_k(data, max_k=10):
print("Elbow method:")
inertias = []
for k in range(1, max_k + 1):
assignments, centroids = kmeans(data, k)
inertia = compute_inertia(data, assignments, centroids)
inertias.append(inertia)
print(f" K={k}: inertia={inertia:.2f}")
print("\nSilhouette scores:")
for k in range(2, max_k + 1):
assignments, centroids = kmeans(data, k)
score = silhouette_score(data, assignments)
print(f" K={k}: silhouette={score:.4f}")
return inertias
Step 3: DBSCAN from scratch
def dbscan(data, eps, min_samples):
n = len(data)
labels = [-1] * n
cluster_id = 0
def region_query(point_idx):
neighbors = []
for i in range(n):
if euclidean_distance(data[point_idx], data[i]) <= eps:
neighbors.append(i)
return neighbors
visited = [False] * n
for i in range(n):
if visited[i]:
continue
visited[i] = True
neighbors = region_query(i)
if len(neighbors) < min_samples:
labels[i] = -1
continue
labels[i] = cluster_id
seed_set = list(neighbors)
seed_set.remove(i)
j = 0
while j < len(seed_set):
q = seed_set[j]
if not visited[q]:
visited[q] = True
q_neighbors = region_query(q)
if len(q_neighbors) >= min_samples:
for nb in q_neighbors:
if nb not in seed_set:
seed_set.append(nb)
if labels[q] == -1:
labels[q] = cluster_id
j += 1
cluster_id += 1
return labels
Step 4: Gaussian Mixture Model (EM algorithm)
def gmm(data, k, max_iterations=100, seed=42):
random.seed(seed)
n = len(data)
d = len(data[0])
indices = random.sample(range(n), k)
means = [list(data[i]) for i in indices]
variances = [1.0] * k
weights = [1.0 / k] * k
def gaussian_pdf(x, mean, variance):
d = len(x)
coeff = 1.0 / ((2 * math.pi * variance) ** (d / 2))
exponent = -sum((xi - mi) ** 2 for xi, mi in zip(x, mean)) / (2 * variance)
return coeff * math.exp(max(exponent, -500))
for iteration in range(max_iterations):
responsibilities = []
for i in range(n):
probs = []
for j in range(k):
probs.append(weights[j] * gaussian_pdf(data[i], means[j], variances[j]))
total = sum(probs)
if total == 0:
total = 1e-300
responsibilities.append([p / total for p in probs])
old_means = [list(m) for m in means]
for j in range(k):
r_sum = sum(responsibilities[i][j] for i in range(n))
if r_sum < 1e-10:
continue
weights[j] = r_sum / n
for dim in range(d):
means[j][dim] = sum(
responsibilities[i][j] * data[i][dim] for i in range(n)
) / r_sum
variances[j] = sum(
responsibilities[i][j]
* sum((data[i][dim] - means[j][dim]) ** 2 for dim in range(d))
for i in range(n)
) / (r_sum * d)
variances[j] = max(variances[j], 1e-6)
shift = sum(
euclidean_distance(old_means[j], means[j]) for j in range(k)
)
if shift < 1e-6:
print(f" GMM converged at iteration {iteration + 1}")
break
assignments = []
for i in range(n):
assignments.append(responsibilities[i].index(max(responsibilities[i])))
return assignments, means, weights, responsibilities
Step 5: Generate test data and run everything
def make_blobs(centers, n_per_cluster=50, spread=0.5, seed=42):
random.seed(seed)
data = []
true_labels = []
for label, (cx, cy) in enumerate(centers):
for _ in range(n_per_cluster):
x = cx + random.gauss(0, spread)
y = cy + random.gauss(0, spread)
data.append([x, y])
true_labels.append(label)
return data, true_labels
def make_moons(n_samples=200, noise=0.1, seed=42):
random.seed(seed)
data = []
labels = []
n_half = n_samples // 2
for i in range(n_half):
angle = math.pi * i / n_half
x = math.cos(angle) + random.gauss(0, noise)
y = math.sin(angle) + random.gauss(0, noise)
data.append([x, y])
labels.append(0)
for i in range(n_half):
angle = math.pi * i / n_half
x = 1 - math.cos(angle) + random.gauss(0, noise)
y = 1 - math.sin(angle) - 0.5 + random.gauss(0, noise)
data.append([x, y])
labels.append(1)
return data, labels
if __name__ == "__main__":
centers = [[2, 2], [8, 3], [5, 8]]
data, true_labels = make_blobs(centers, n_per_cluster=50, spread=0.8)
print("=== K-Means on 3 blobs ===")
assignments, centroids = kmeans(data, k=3)
print(f" Centroids: {[[round(c, 2) for c in cent] for cent in centroids]}")
sil = silhouette_score(data, assignments)
print(f" Silhouette score: {sil:.4f}")
print("\n=== Elbow Method ===")
find_best_k(data, max_k=6)
print("\n=== DBSCAN on 3 blobs ===")
db_labels = dbscan(data, eps=1.5, min_samples=5)
n_clusters = len(set(db_labels) - {-1})
n_noise = db_labels.count(-1)
print(f" Found {n_clusters} clusters, {n_noise} noise points")
print("\n=== GMM on 3 blobs ===")
gmm_assignments, gmm_means, gmm_weights, _ = gmm(data, k=3)
print(f" Means: {[[round(m, 2) for m in mean] for mean in gmm_means]}")
print(f" Weights: {[round(w, 3) for w in gmm_weights]}")
gmm_sil = silhouette_score(data, gmm_assignments)
print(f" Silhouette score: {gmm_sil:.4f}")
print("\n=== DBSCAN on moons (non-spherical clusters) ===")
moon_data, moon_labels = make_moons(n_samples=200, noise=0.1)
moon_db = dbscan(moon_data, eps=0.3, min_samples=5)
n_moon_clusters = len(set(moon_db) - {-1})
n_moon_noise = moon_db.count(-1)
print(f" Found {n_moon_clusters} clusters, {n_moon_noise} noise points")
print("\n=== K-Means on moons (will fail to separate) ===")
moon_km, moon_centroids = kmeans(moon_data, k=2)
moon_sil = silhouette_score(moon_data, moon_km)
print(f" Silhouette score: {moon_sil:.4f}")
print(" K-Means splits moons poorly because they are not spherical")
print("\n=== Anomaly detection with DBSCAN ===")
anomaly_data = list(data)
anomaly_data.append([20.0, 20.0])
anomaly_data.append([-5.0, -5.0])
anomaly_data.append([15.0, 0.0])
anomaly_labels = dbscan(anomaly_data, eps=1.5, min_samples=5)
anomalies = [
anomaly_data[i]
for i in range(len(anomaly_labels))
if anomaly_labels[i] == -1
]
print(f" Detected {len(anomalies)} anomalies")
for a in anomalies[-3:]:
print(f" Point {[round(v, 2) for v in a]}")
Use It
With scikit-learn, the same algorithms are one-liners:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score as sklearn_silhouette
km = KMeans(n_clusters=3, random_state=42).fit(data)
db = DBSCAN(eps=1.5, min_samples=5).fit(data)
agg = AgglomerativeClustering(n_clusters=3).fit(data)
gmm_model = GaussianMixture(n_components=3, random_state=42).fit(data)
The from-scratch versions show you exactly what these libraries compute. K-Means iterates between assigning and recomputing. DBSCAN grows clusters from dense seeds. GMM alternates between expectation and maximization. The library versions add numerical stability, smarter initialization (K-Means++), and GPU acceleration, but the core logic is the same.
Ship It
This lesson produces working implementations of K-Means, DBSCAN, and GMM from scratch. The clustering code can be reused as a foundation for more advanced unsupervised methods.
Exercises
- Implement K-Means++ initialization: instead of picking random centroids, pick the first randomly and each subsequent centroid with probability proportional to its squared distance from the nearest existing centroid. Compare convergence speed to random initialization.
- Add hierarchical agglomerative clustering to the code. Implement Ward's linkage and produce a dendrogram (as a nested list of merges). Cut it at different levels and compare to K-Means results.
- Build a simple anomaly detection pipeline: run DBSCAN and GMM on the same data, flag points that both methods agree are outliers (noise in DBSCAN, low probability in GMM). Measure the overlap and discuss when the methods disagree.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Clustering | "Grouping similar things" | Partitioning data into subsets where within-group similarity exceeds between-group similarity, measured by a specific distance metric |
| Centroid | "The center of a cluster" | The mean of all points assigned to a cluster; used by K-Means as the cluster representative |
| Inertia | "How tight the clusters are" | Sum of squared distances from each point to its assigned centroid; lower is tighter |
| Silhouette score | "How well-separated clusters are" | For each point, (b - a) / max(a, b) where a is mean intra-cluster distance and b is mean nearest-cluster distance |
| Core point | "A point in a dense region" | A point with at least min_samples neighbors within eps distance, in DBSCAN |
| EM algorithm | "Soft K-Means" | Expectation-Maximization: iteratively compute membership probabilities (E-step) and update distribution parameters (M-step) |
| Dendrogram | "A tree of clusters" | A tree diagram showing the order and distance at which clusters were merged in hierarchical clustering |
| Anomaly | "An outlier" | A data point that does not conform to the expected pattern, identified as noise by DBSCAN or low-probability by GMM |
Further Reading
- Stanford CS229 - Unsupervised Learning - Andrew Ng's lecture notes on clustering and EM
- scikit-learn Clustering Guide - practical comparison of all clustering algorithms with visual examples
- DBSCAN original paper (Ester et al., 1996) - the paper that introduced density-based clustering