GPU Setup & Cloud
> Training on CPU is fine for learning. Training for real needs a GPU.
Type: Build
Languages: Python
Prerequisites: Phase 0, Lesson 01
Time: ~45 minutes
Learning Objectives
- Verify local GPU availability using
nvidia-smiand PyTorch's CUDA API - Configure Google Colab with a T4 GPU for free cloud-based experiments
- Benchmark matrix multiplication on CPU vs GPU and measure the speedup
- Estimate the largest model that fits in your VRAM using the fp16 rule of thumb
The Problem
Most lessons in phases 1-3 run fine on CPU. But once you start training CNNs, transformers, or LLMs (phases 4+), you need GPU acceleration. A training run that takes 8 hours on CPU takes 10 minutes on GPU.
You have three options: local GPU, cloud GPU, or Google Colab (free).
The Concept
Your options:
1. Local NVIDIA GPU
Cost: $0 (you already have it)
Setup: Install CUDA + cuDNN
Best for: Regular use, large datasets
2. Google Colab (free tier)
Cost: $0
Setup: None
Best for: Quick experiments, no GPU at home
3. Cloud GPU (Lambda, RunPod, Vast.ai)
Cost: $0.20-2.00/hr
Setup: SSH + install
Best for: Serious training, large models
Build It
Option 1: Local NVIDIA GPU
Check if you have one:
nvidia-smi
Install PyTorch with CUDA:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Option 2: Google Colab
- Go to colab.research.google.com
- Runtime > Change runtime type > T4 GPU
- Run
!nvidia-smito verify
Upload notebooks from this course directly to Colab.
Option 3: Cloud GPU
For Lambda Labs, RunPod, or Vast.ai:
ssh user@your-gpu-instance
pip install torch torchvision torchaudio
python -c "import torch; print(torch.cuda.get_device_name(0))"
No GPU? No problem.
Most lessons work on CPU. The ones that need GPU will say so and include Colab links.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
Build It: GPU vs CPU benchmark
import torch
import time
size = 5000
a_cpu = torch.randn(size, size)
b_cpu = torch.randn(size, size)
start = time.time()
c_cpu = a_cpu @ b_cpu
cpu_time = time.time() - start
print(f"CPU: {cpu_time:.3f}s")
if torch.cuda.is_available():
a_gpu = a_cpu.to("cuda")
b_gpu = b_cpu.to("cuda")
torch.cuda.synchronize()
start = time.time()
c_gpu = a_gpu @ b_gpu
torch.cuda.synchronize()
gpu_time = time.time() - start
print(f"GPU: {gpu_time:.3f}s")
print(f"Speedup: {cpu_time / gpu_time:.0f}x")
Exercises
- Run the benchmark above and compare CPU vs GPU times
- If you don't have a GPU, run it on Google Colab and compare
- Check how much GPU memory you have and estimate the largest model you can fit (rule of thumb: 2 bytes per parameter for fp16)
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| CUDA | "GPU programming" | NVIDIA's parallel computing platform that lets you run code on the GPU |
| VRAM | "GPU memory" | Video RAM on the GPU, separate from system RAM. Limits model size. |
| fp16 | "Half precision" | 16-bit floating point, uses half the memory of fp32 with minimal accuracy loss |
| Tensor Core | "Fast matrix hardware" | Specialized GPU cores for matrix multiplication, 4-8x faster than regular cores |