What Is Machine Learning

> Machine learning is teaching computers to find patterns in data instead of writing rules by hand.

Type: Learn

Languages: Python

Prerequisites: Phase 1 (Math Foundations)

Time: ~45 minutes

Learning Objectives

The Problem

You want to build a spam filter. The traditional approach: sit down and write hundreds of rules. "If the email contains 'FREE MONEY', mark it spam. If it has more than 3 exclamation marks, mark it spam." You spend weeks writing rules. Then spammers change their wording. Your rules break. You write more rules. The cycle never ends.

Machine learning flips this. Instead of writing rules, you give the computer thousands of labeled emails ("spam" or "not spam") and let it figure out the rules on its own. The computer finds patterns you never would have thought of. When spammers change tactics, you retrain on new data instead of rewriting code.

This shift from "programming rules" to "learning from data" is the core of machine learning. Every recommendation engine, voice assistant, self-driving car, and language model works this way.

The Concept

Learning From Data, Not Rules

Traditional programming and machine learning solve problems in opposite directions.

flowchart LR subgraph Traditional["Traditional Programming"] direction LR R[Rules] --> P1[Program] D1[Data] --> P1 P1 --> O1[Output] end subgraph ML["Machine Learning"] direction LR D2[Data] --> P2[Learning Algorithm] O2[Expected Output] --> P2 P2 --> M[Model / Rules] end

Traditional programming: you write the rules. The program applies them to data to produce output.

Machine learning: you provide data and expected outputs. The algorithm discovers the rules.

The "model" that comes out of training IS the rules, encoded as numbers (weights, parameters). It generalizes from examples it has seen to make predictions on data it has never seen.

The Three Types of Machine Learning

flowchart TD ML[Machine Learning] --> SL[Supervised Learning] ML --> UL[Unsupervised Learning] ML --> RL[Reinforcement Learning] SL --> C[Classification] SL --> R[Regression] UL --> CL[Clustering] UL --> DR[Dimensionality Reduction] RL --> PO[Policy Optimization] RL --> VL[Value Learning]

Supervised Learning: You have input-output pairs. The model learns to map inputs to outputs.

Unsupervised Learning: You have inputs only. No labels. The model finds structure on its own.

Reinforcement Learning: An agent takes actions in an environment and receives rewards or penalties. It learns a strategy (policy) to maximize total reward.

Most of what you will build in practice uses supervised learning. Unsupervised learning is common for preprocessing and exploration. Reinforcement learning powers game AI, robotics, and RLHF for language models.

Beyond the Big Three

The three categories above are clean, but real-world ML often blurs the lines.

Semi-supervised learning uses a small set of labeled data and a large set of unlabeled data. You might have 100 labeled medical images and 100,000 unlabeled ones. Techniques include:

Self-supervised learning creates supervision from the data itself. No human labels needed at all. The model creates its own prediction task from the structure of the data.

These are not separate categories from the big three. They are strategies that combine supervised and unsupervised ideas. Self-supervised learning is technically supervised (the model predicts something), but the labels are generated automatically, not by humans.

Classification vs Regression

These are the two main supervised learning tasks.

Aspect Classification Regression
Output Discrete categories Continuous numbers
Example "Is this email spam?" "What will the house price be?"
Output space {cat, dog, bird} Any real number
Loss function Cross-entropy, accuracy Mean squared error, MAE
Decision Boundaries between classes A curve that fits the data

Classification answers "which category?" Regression answers "how much?"

Some problems can be framed either way. Predicting if a stock goes up or down is classification. Predicting the exact price is regression.

The ML Workflow

Every machine learning project follows the same pipeline, regardless of the algorithm.

flowchart LR A[Collect Data] --> B[Clean & Explore] B --> C[Feature Engineering] C --> D[Split Data] D --> E[Train Model] E --> F[Evaluate] F -->|Not good enough| C F -->|Good enough| G[Deploy] G --> H[Monitor] H -->|Performance drops| A

Collect Data: Gather raw data. More data is almost always better, but quality matters more than quantity.

Clean & Explore: Handle missing values, remove duplicates, visualize distributions, spot anomalies. This step often takes 60-80% of total project time.

Feature Engineering: Transform raw data into features the model can use. Turn dates into day-of-week. Normalize numerical columns. Encode categorical variables. Good features matter more than fancy algorithms.

Split Data: Divide into training, validation, and test sets. The model trains on training data, you tune hyperparameters on validation data, and you report final performance on test data.

Train Model: Feed training data into an algorithm. The algorithm adjusts internal parameters to minimize a loss function.

Evaluate: Measure performance on validation/test data. If performance is not acceptable, go back and try different features, algorithms, or hyperparameters.

Deploy: Put the model into production where it makes predictions on new data.

Monitor: Track performance over time. Data distributions change (data drift), and models degrade. When performance drops, retrain.

Training, Validation, and Test Splits

This is the most important concept beginners get wrong. You must evaluate your model on data it has never seen during training. Otherwise you are measuring memorization, not learning.

flowchart LR subgraph Dataset["Full Dataset (100%)"] direction LR TR["Training Set (70%)"] VA["Validation Set (15%)"] TE["Test Set (15%)"] end TR -->|Train model| M[Model] M -->|Tune hyperparameters| VA VA -->|Final evaluation| TE
Split Purpose When used Typical size
Training Model learns from this data During training 60-80%
Validation Tune hyperparameters, compare models After each training run 10-20%
Test Final unbiased performance estimate Once, at the very end 10-20%

The test set is sacred. You look at it exactly once. If you keep adjusting your model based on test performance, you are effectively training on the test set and your reported numbers are meaningless.

For small datasets, use k-fold cross-validation: split data into k parts, train on k-1 parts, validate on the remaining part, rotate, and average results.

Overfitting vs Underfitting

flowchart LR subgraph UF["Underfitting"] U1["Model too simple"] U2["High bias"] U3["Misses patterns"] end subgraph GF["Good Fit"] G1["Right complexity"] G2["Balanced"] G3["Generalizes well"] end subgraph OF["Overfitting"] O1["Model too complex"] O2["High variance"] O3["Memorizes noise"] end UF -->|Increase complexity| GF GF -->|Too much complexity| OF

Underfitting: The model is too simple to capture the patterns in the data. A straight line trying to fit a curved relationship. Training error is high. Test error is high.

Overfitting: The model is too complex and memorizes the training data, including its noise. A wiggly curve that passes through every training point but fails on new data. Training error is low. Test error is high.

Good fit: The model captures real patterns without memorizing noise. Training error and test error are both reasonably low.

Signs of overfitting:

Fixes for overfitting:

Fixes for underfitting:

The Bias-Variance Tradeoff

This is the mathematical framework behind overfitting and underfitting.

Bias: Error from wrong assumptions in the model. A linear model has high bias when the true relationship is nonlinear. High bias leads to underfitting.

Variance: Error from sensitivity to small fluctuations in the training data. A model with high variance gives very different predictions when trained on different subsets of data. High variance leads to overfitting.

Model complexity Bias Variance Result
Too low (linear model for curved data) High Low Underfitting
Just right Medium Medium Good generalization
Too high (degree-20 polynomial for 10 points) Low High Overfitting

Total error = Bias^2 + Variance + Irreducible noise

You cannot reduce irreducible noise (it is randomness in the data itself). You want to find the sweet spot where bias^2 + variance is minimized.

No Free Lunch Theorem

There is no single algorithm that works best for every problem. An algorithm that performs well on one class of problems will perform poorly on another. This is why data scientists try multiple algorithms and compare results.

In practice, the choice depends on:

When NOT to Use Machine Learning

ML is powerful but not always the right tool. Before reaching for a model, ask whether you actually need one.

Do not use ML when:

Use this decision flowchart:

flowchart TD A["Do you have data?"] -->|No| B["Collect data first or use rules"] A -->|Yes| C["Can you write the rules explicitly?"] C -->|"Yes, and they are simple"| D["Use rules. Skip ML."] C -->|"No, or they are too complex"| E["Is the cost of errors acceptable?"] E -->|"No, need guaranteed correctness"| F["Use deterministic methods"] E -->|Yes| G["Do you need explainability?"] G -->|"Yes, strictly"| H["Use interpretable models only"] G -->|"No, or partially"| I["Use ML"] I --> J["Do you have enough labeled data?"] J -->|Yes| K["Supervised learning"] J -->|"Some labels"| L["Semi-supervised learning"] J -->|"No labels"| M["Unsupervised or self-supervised"]

Build It

The code in code/ml_intro.py implements a nearest centroid classifier from scratch, the simplest possible ML algorithm. It demonstrates the core idea: learn from data, then predict on new data.

Step 1: Nearest Centroid Classifier from Scratch

The nearest centroid classifier computes the center (mean) of each class in the training data. To predict, it assigns each new point to the class whose center is closest.

class NearestCentroid:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.centroids = np.array([
            X[y == c].mean(axis=0) for c in self.classes
        ])

    def predict(self, X):
        distances = np.array([
            np.sqrt(((X - c) ** 2).sum(axis=1))
            for c in self.centroids
        ])
        return self.classes[distances.argmin(axis=0)]

That is the entire algorithm. Fit computes two means. Predict computes distances. No gradient descent, no iteration, no hyperparameters.

Step 2: Train on Synthetic Data

We generate a 2D classification dataset with two classes that overlap slightly. The centroid classifier draws a linear decision boundary between the class centers.

rng = np.random.RandomState(42)
X_class0 = rng.randn(100, 2) + np.array([1.0, 1.0])
X_class1 = rng.randn(100, 2) + np.array([-1.0, -1.0])
X = np.vstack([X_class0, X_class1])
y = np.array([0] * 100 + [1] * 100)

Step 3: Compare Against a Baseline

Every ML model should be compared against a trivial baseline. Here, the baseline predicts a random class. If your ML model does not beat random guessing, something is wrong.

baseline_preds = rng.choice([0, 1], size=len(y_test))
baseline_acc = np.mean(baseline_preds == y_test)

The centroid classifier should get around 90%+ accuracy on this clean dataset. Random baseline gets around 50%.

Why This Matters

The nearest centroid classifier is trivially simple. It has no hyperparameters, no iteration, no gradient descent. Yet it captures the fundamental ML pattern:

  1. Learn a representation from training data (the centroids)
  2. Predict on new data using that representation (nearest distance)
  3. Evaluate against a baseline (random guessing)

Every ML algorithm, from logistic regression to transformers, follows this same three-step pattern. The representation gets more complex, but the workflow stays the same.

Step 4: What the Centroid Classifier Cannot Do

The nearest centroid classifier assumes each class forms a single blob. It draws linear decision boundaries. It fails when:

These limitations motivate every other algorithm you will learn. K-nearest neighbors handles multiple clusters. Decision trees handle nonlinear boundaries. Feature scaling fixes the scale problem. Each lesson builds on the limitations of the previous one.

Use It

sklearn provides NearestCentroid and synthetic data generators:

from sklearn.neighbors import NearestCentroid
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=500, n_features=2, n_redundant=0,
    n_clusters_per_class=1, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

clf = NearestCentroid()
clf.fit(X_train, y_train)
print(f"Accuracy: {clf.score(X_test, y_test):.3f}")

Ship It

This lesson produces outputs/prompt-ml-problem-framer.md -- a prompt that turns vague business problems into concrete ML tasks. Give it a problem description ("we want to reduce churn" or "predict demand for next quarter") and it identifies the learning type, defines the prediction target, lists candidate features, picks a success metric, establishes a baseline, and flags pitfalls like data leakage or class imbalance. Use it at the start of any ML project to avoid building the wrong thing.

Key Terms

Term What people say What it actually means
Model "The AI" A mathematical function with learnable parameters that maps inputs to outputs
Training "Teaching the AI" Running an optimization algorithm to adjust model parameters so predictions match known outputs
Feature "An input column" A measurable property of the data that the model uses to make predictions
Label "The answer" The known output for a training example, used to compute the error signal
Hyperparameter "A setting you tweak" A parameter set before training that controls the learning process (learning rate, number of layers)
Loss function "How wrong the model is" A function that measures the gap between predicted and actual outputs, which training tries to minimize
Overfitting "It memorized the test" The model learned training-specific noise instead of general patterns, so it fails on new data
Underfitting "It didn't learn anything" The model is too simple to capture the real patterns in the data
Generalization "It works on new data" The model's ability to make accurate predictions on data it was not trained on
Cross-validation "Testing on different chunks" Repeatedly splitting data into train/test folds and averaging results, giving a more robust performance estimate
Regularization "Keeping weights small" Adding a penalty term to the loss function that discourages overly complex models
Data drift "The world changed" The statistical distribution of incoming data shifts over time, degrading model performance

Exercises

  1. Take any dataset (e.g., Iris, Titanic). Split it 70/15/15 into train/validation/test. Explain why you should not tune hyperparameters on the test set.
  2. List three real-world problems. For each one, identify whether it is classification, regression, or clustering, and whether it is supervised or unsupervised.
  3. A model gets 99% accuracy on training data but 60% on test data. Diagnose the problem and list three things you would try to fix it.

Further Reading