Time Series Fundamentals

> Past performance does predict future results -- if you check for stationarity first.

Type: Build

Language: Python

Prerequisites: Phase 2, Lessons 01-09

Time: ~90 minutes

Learning Objectives

The Problem

You have data ordered by time. Daily sales, hourly temperature, per-minute CPU usage, weekly stock prices. You want to predict the next value, the next week, the next quarter.

You reach for your standard ML toolkit: random train/test split, cross-validation, feature matrix in, prediction out. Every step is wrong.

Time series breaks the assumptions that standard ML relies on. Samples are not independent -- today's temperature depends on yesterday's. Random splits leak future information into the past. Features that look great in backtest fail in production because they rely on patterns that shift over time.

A model that gets 95% accuracy with random cross-validation might get 55% with proper time-based evaluation. The difference is not a technicality. It is the difference between a model that works on paper and one that works in production.

This lesson covers the fundamentals: what makes time data different, how to evaluate models honestly, and how to turn a time series into features that standard ML models can consume.

The Concept

What Makes Time Series Different

Standard ML assumes i.i.d. -- independent and identically distributed. Each sample is drawn from the same distribution, independently of other samples. Time series violates both:

These violations are not minor. They change how you build features, how you evaluate models, and which algorithms work.

flowchart LR subgraph IID["Standard ML (i.i.d.)"] direction TB S1[Sample 1] ~~~ S2[Sample 2] S2 ~~~ S3[Sample 3] end subgraph TS["Time Series (not i.i.d.)"] direction LR T1[t=1] --> T2[t=2] T2 --> T3[t=3] T3 --> T4[t=4] end style S1 fill:#dfd style S2 fill:#dfd style S3 fill:#dfd style T1 fill:#ffd style T2 fill:#ffd style T3 fill:#ffd style T4 fill:#ffd

In standard ML, samples are interchangeable. Shuffling them changes nothing. In time series, order is everything. Shuffling destroys the signal.

Components of a Time Series

Every time series is a combination of:

flowchart TD A[Observed Time Series] --> B[Trend] A --> C[Seasonality] A --> D[Residual/Noise] B --> E[Long-term direction: up, down, flat] C --> F[Repeating patterns: daily, weekly, yearly] D --> G[Random variation after removing trend and seasonality]

Stationarity

A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Most forecasting methods assume stationarity.

Why it matters: A non-stationary series has a mean that drifts. A model trained on data from January has learned a different mean than what February will show. It will be systematically wrong.

How to check: Compute rolling mean and rolling standard deviation over windows. If they drift, the series is non-stationary.

How to fix: Differencing. Instead of modeling the raw values, model the change between consecutive values:

diff[t] = value[t] - value[t-1]

If one round of differencing does not make the series stationary, apply it again (second-order differencing). Most real-world series need at most two rounds.

Example:

Original series: [100, 102, 106, 112, 120]

First difference: [2, 4, 6, 8] (still trending upward)

Second difference: [2, 2, 2] (constant -- stationary)

The original series had a quadratic trend. First differencing turned it into a linear trend. Second differencing made it flat. In practice, you rarely need more than two rounds.

Formal test: The Augmented Dickey-Fuller (ADF) test is the standard statistical test for stationarity. The null hypothesis is "the series is non-stationary." A p-value below 0.05 means you can reject the null and conclude stationarity. We do not implement ADF from scratch (it requires asymptotic distribution tables), but the rolling statistics approach in our code gives a practical visual check.

Autocorrelation

Autocorrelation measures how much a value at time t correlates with the value at time t-k (k steps in the past). The autocorrelation function (ACF) plots this correlation for each lag k.

ACF tells you:

PACF (Partial Autocorrelation Function) removes indirect correlations. If today correlates with 3 days ago only because both correlate with yesterday, PACF at lag 3 will be zero while ACF at lag 3 will not.

Lag Features: Turning Time Series into Supervised Learning

Standard ML models need a feature matrix X and a target y. Time series gives you a single column of values. The bridge is lag features.

Take the series [10, 12, 14, 13, 15] and create lag-1 and lag-2 features:

lag_2 lag_1 target
10 12 14
12 14 13
14 13 15

Now you have a standard regression problem. Any ML model (linear regression, random forest, gradient boosting) can predict the target from the lags.

Additional features you can engineer:

How many lags? Use the autocorrelation function. If ACF is significant up to lag 10, use at least 10 lags. If there is weekly seasonality, include lag 7 (and possibly 14). More lags give the model more history but also more features to fit, increasing the risk of overfitting.

The target alignment trap. When creating lag features, the target must be the value at time t, and all features must use values at time t-1 or earlier. If you accidentally include the value at time t as a feature, you have a perfect predictor -- and a completely useless model. This is the most common bug in time series feature engineering.

Walk-Forward Validation

This is the most important concept in this lesson. Standard k-fold cross-validation randomly assigns samples to train and test. For time series, this leaks future information.

flowchart TD subgraph WRONG["Random Split (WRONG)"] direction LR W1[Jan] --> W2[Mar] W2 --> W3[Feb] W3 --> W4[May] W4 --> W5[Apr] style W1 fill:#fdd style W3 fill:#fdd style W5 fill:#fdd style W2 fill:#dfd style W4 fill:#dfd end subgraph RIGHT["Walk-Forward (CORRECT)"] direction LR R1["Train: Jan-Mar"] --> R2["Test: Apr"] R3["Train: Jan-Apr"] --> R4["Test: May"] R5["Train: Jan-May"] --> R6["Test: Jun"] style R1 fill:#dfd style R2 fill:#fdd style R3 fill:#dfd style R4 fill:#fdd style R5 fill:#dfd style R6 fill:#fdd end

Walk-forward validation:

  1. Train on data up to time t
  2. Predict at time t+1 (or t+1 to t+k for multi-step)
  3. Slide the window forward
  4. Repeat

Each test fold only contains data that comes after all training data. No future leakage. This gives you an honest estimate of how the model will perform when deployed.

Expanding window uses all historical data for training (window grows). Sliding window uses a fixed-size training window (window slides). Use expanding when you believe older data is still relevant. Use sliding when the world changes and old data hurts.

ARIMA Intuition

ARIMA is the classical time series model. It has three components:

ARIMA(p, d, q) combines all three. You choose p, d, q based on ACF/PACF analysis or automated search (auto-ARIMA).

We will not implement ARIMA from scratch -- it requires numerical optimization that is beyond the scope of this lesson. The key insight is understanding what each component does so you can interpret ARIMA results and know when to use it.

When to Use What

Approach Best For Handles Seasonality Handles External Features
Lag features + ML Tabular with many external features With calendar features Yes
ARIMA Single univariate series, short-term SARIMA variant No (ARIMAX for limited)
Exponential smoothing Simple trend + seasonality Yes (Holt-Winters) No
Prophet Business forecasting, holidays Yes (Fourier terms) Limited
Neural networks (LSTM, Transformer) Long sequences, many series Learned Yes

For most practical problems, lag features + gradient boosting is the strongest starting point. It handles external features naturally, does not require stationarity, and is easy to debug.

Forecasting Horizons and Strategies

Single-step forecasting predicts one time step ahead. Multi-step forecasting predicts multiple steps. There are three strategies:

Recursive (iterated): Predict one step ahead, use the prediction as input for the next step. Simple but errors accumulate -- each prediction uses the previous prediction, so mistakes compound.

Direct: Train a separate model for each horizon. Model-1 predicts t+1, Model-5 predicts t+5. No error accumulation, but each model has fewer training samples and they do not share information.

Multi-output: Train one model that outputs all horizons simultaneously. Shares information across horizons but requires a model that supports multiple outputs (or a custom loss function).

For most practical problems, start with recursive for short horizons (1-5 steps) and direct for longer horizons.

Common Mistakes in Time Series

Mistake Why it happens How to fix
Random train/test split Habit from standard ML Use walk-forward or temporal split
Using future features Feature at time t included by mistake Audit every feature for temporal alignment
Overfitting to seasonality Model memorizes calendar patterns Hold out a full seasonal cycle in the test set
Ignoring scale changes Revenue doubles but patterns stay Model percentage change instead of absolute
Too many lag features "More history is better" Use ACF to determine relevant lags
Not differencing "The model will figure it out" Tree models handle trends; linear models need stationarity

Build It

The code in code/time_series.py implements the core building blocks from scratch.

Lag Feature Creator

def make_lag_features(series, n_lags):
    n = len(series)
    X = np.full((n, n_lags), np.nan)
    for lag in range(1, n_lags + 1):
        X[lag:, lag - 1] = series[:-lag]
    valid = ~np.isnan(X).any(axis=1)
    return X[valid], series[valid]

This converts a 1D series into a feature matrix where each row has the last n_lags values as features, and the current value as the target.

Walk-Forward Cross-Validation

def walk_forward_split(n_samples, n_splits=5, min_train=50):
    assert min_train < n_samples, "min_train must be less than n_samples"
    step = max(1, (n_samples - min_train) // n_splits)
    for i in range(n_splits):
        train_end = min_train + i * step
        test_end = min(train_end + step, n_samples)
        if train_end >= n_samples:
            break
        yield slice(0, train_end), slice(train_end, test_end)

Each split ensures training data comes strictly before test data. The training window expands with each fold.

Simple Autoregressive Model

A pure AR model is just linear regression on lag features:

class SimpleAR:
    def __init__(self, n_lags=5):
        self.n_lags = n_lags
        self.weights = None
        self.bias = None

    def fit(self, series):
        X, y = make_lag_features(series, self.n_lags)
        # Solve via normal equations
        X_b = np.column_stack([np.ones(len(X)), X])
        theta = np.linalg.lstsq(X_b, y, rcond=None)[0]
        self.bias = theta[0]
        self.weights = theta[1:]
        return self

This is conceptually identical to linear regression from Lesson 02, but applied to time-lagged versions of the same variable.

Stationarity Check

The code computes rolling statistics to visually and numerically assess stationarity:

def check_stationarity(series, window=50):
    rolling_mean = np.array([
        series[max(0, i - window):i].mean()
        for i in range(1, len(series) + 1)
    ])
    rolling_std = np.array([
        series[max(0, i - window):i].std()
        for i in range(1, len(series) + 1)
    ])
    return rolling_mean, rolling_std

If the rolling mean drifts or the rolling std changes, the series is non-stationary. Apply differencing and check again.

The code also checks stationarity by comparing the first half and second half of the series. If the means differ by more than half a standard deviation or the variance ratio exceeds 2x, the series is flagged as non-stationary.

Autocorrelation

def autocorrelation(series, max_lag=20):
    n = len(series)
    mean = series.mean()
    var = series.var()
    acf = np.zeros(max_lag + 1)
    for k in range(max_lag + 1):
        cov = np.mean((series[:n-k] - mean) * (series[k:] - mean))
        acf[k] = cov / var if var > 0 else 0
    return acf

Use It

With sklearn, you use lag features directly with any regressor:

from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor

X, y = make_lag_features(series, n_lags=10)

for train_idx, test_idx in walk_forward_split(len(X)):
    model = Ridge(alpha=1.0)
    model.fit(X[train_idx], y[train_idx])
    predictions = model.predict(X[test_idx])

For ARIMA, use statsmodels:

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(train_series, order=(5, 1, 2))
fitted = model.fit()
forecast = fitted.forecast(steps=30)

The code in time_series.py demonstrates both approaches and compares them using walk-forward validation.

sklearn TimeSeriesSplit

sklearn provides TimeSeriesSplit which implements walk-forward validation:

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

This is equivalent to our from-scratch walk_forward_split but integrated into sklearn's cross-validation framework. You can use it with cross_val_score:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=TimeSeriesSplit(n_splits=5))
print(f"Mean score: {scores.mean():.4f} +/- {scores.std():.4f}")

Evaluation Metrics

Time series forecasting uses regression metrics, but with time-aware context:

Rolling Features

The code demonstrates adding rolling statistics (mean, std, min, max over windows of 7 and 14 days) to lag features. These give the model information about recent trends and volatility that lag features alone do not capture.

For example, if the rolling mean is rising, it suggests an upward trend. If the rolling std is increasing, it suggests growing volatility. These are the kinds of patterns that tree-based models can learn from but linear models cannot.

Ship It

This lesson produces:

Baselines You Must Beat

Before building any model, establish baselines:

  1. Last value (persistence). Predict that tomorrow will be the same as today. For many series, this is surprisingly hard to beat.
  2. Seasonal naive. Predict that today will be the same as the same day last week (or last year). If your model cannot beat this, it has not learned any useful pattern beyond seasonality.
  3. Moving average. Predict the average of the last k values. Smooths noise but cannot capture sudden changes.

If your fancy ML model loses to the seasonal naive baseline, you have a bug. Most commonly: future leakage in features, wrong evaluation method, or the series is truly random and unpredictable.

Practical Tips

  1. Start with plotting. Before any modeling, plot the raw series. Look for trends, seasonality, outliers, structural breaks (sudden changes in behavior). A 30-second visual inspection often tells you more than an hour of automated analysis.
  1. Difference first, model second. If the series has a clear trend, difference it before creating lag features. Tree-based models can handle trends, but linear models cannot, and differencing never hurts.
  1. Hold out at least one full seasonal cycle. If you have weekly seasonality, your test set needs at least one full week. If monthly, at least one full month. Otherwise you cannot evaluate whether the model captured the seasonal pattern.
  1. Monitor in production. Time series models degrade over time as the world changes. Track prediction errors on a rolling basis. When errors start increasing, retrain the model on recent data.
  1. Beware of regime changes. A model trained on pre-pandemic data will not predict post-pandemic behavior. Include indicators of known regime changes as features, or use a sliding window that forgets old data.
  1. Log-transform skewed series. Revenue, prices, and counts are often right-skewed. Taking the log stabilizes variance and makes multiplicative patterns additive, which linear models can handle. Forecast in log space, then exponentiate to get back to original units.

Exercises

  1. Stationarity experiment. Generate a series with a linear trend. Check stationarity with rolling statistics. Apply first differencing. Check again. How many rounds of differencing does it take for a quadratic trend?
  1. Lag selection. Compute ACF on a seasonal series (period=7). Which lags have the highest autocorrelation? Create lag features using only those lags (not consecutive lags). Does accuracy improve compared to using lags 1 through 7?
  1. Walk-forward vs random split. Train a Ridge regression on lag features. Evaluate with random 80/20 split and with walk-forward validation. How much does the random split overestimate performance?
  1. Feature engineering. Add rolling mean (window=7), rolling std (window=7), and day-of-week features to the lag features. Compare accuracy with and without these extras using walk-forward validation.
  1. Multi-step forecasting. Modify the AR model to predict 5 steps ahead instead of 1. Compare two strategies: (a) predict one step, use the prediction as input for the next step (recursive), and (b) train separate models for each horizon (direct). Which is more accurate?

Key Terms

Term What people say What it actually means
Stationarity "The stats don't change over time" A series whose mean, variance, and autocorrelation structure are constant over time
Differencing "Subtract consecutive values" Computing y[t] - y[t-1] to remove trends and achieve stationarity
Autocorrelation (ACF) "How a series correlates with itself" The correlation between a time series and a lagged copy of itself, as a function of the lag
Partial autocorrelation (PACF) "Direct correlation only" Autocorrelation at lag k after removing the effect of all shorter lags
Lag features "Past values as inputs" Using y[t-1], y[t-2], ..., y[t-k] as features to predict y[t]
Walk-forward validation "Time-respecting cross-validation" Evaluation where training data always precedes test data chronologically
ARIMA "The classic time series model" AutoRegressive Integrated Moving Average: combines past values (AR), differencing (I), and past errors (MA)
Seasonality "Repeating calendar patterns" Regular, predictable cycles in a time series tied to calendar periods (daily, weekly, yearly)
Trend "The long-term direction" A persistent increase or decrease in the series level over time
Expanding window "Use all history" Walk-forward validation where the training set grows with each fold
Sliding window "Fixed-size history" Walk-forward validation where the training set is a fixed-length window that slides forward

Further Reading