Machine Learning in 15 Minutes: A Crash Course for Engineers Who Build Things
You already know systems. Let me map ML onto what you know.
I’m a self-taught senior staff engineer. I’ve built mobile apps, full-stack web platforms, motion capture software, recording studios. I’ve never had a machine learning class. And I just fine-tuned a 3-billion-parameter language model on my own hardware to do one specific job better than models 20x its size.
This is the crash course I wish someone had given me. No math prerequisites. No PhD gatekeeping. Just the concepts, mapped onto things you already understand.
If you’ve ever deployed a web server, you already have the mental models for all of this.
A Neural Network Is a Function With Knobs
You know f(x) = y. A neural network is f(x, weights) = y, where weights are millions of numbers that determine the function’s behavior. Training is the process of turning the knobs until f produces the output you want for the inputs you have.
That’s it.
Everything else in ML is details about which knobs, how many, and how to turn them.
Training Is Compilation. Inference Is Runtime.
When you compile a TypeScript project, you take source (your code) and produce an artifact (the bundle) that executes later. The compiler is slow and expensive. The compiled program is fast and cheap to run. You never ship the compiler to production. You ship the binary.
Training works the same way. You take source (training data) and produce an artifact (model weights) through an expensive process (the training loop). Then you deploy the weights and run inference — fast, cheap, repeatable.
Fine-tuning is recompiling with a patch. You don’t rebuild from source. You link in new object files on top of the existing binary. In ML, those object files are called LoRA adapters, and I’ll explain them in a minute.
Parameters Are Memory. Layers Are Your Call Stack.
A 3B model has 3 billion tunable numbers. Think of it as 3 billion bytes of learned memory. A 70B model has 70 billion. More memory means more capacity to store patterns, relationships, and facts.
But it also stores irrelevant patterns.
If 60 billion of those 70 billion parameters encode knowledge about cooking recipes, medieval history, and JavaScript trivia, they’re dead weight for your specific task. A 3B model trained exclusively on your domain uses 100% of its capacity on what matters.
It’s the difference between a 64GB phone running one app versus a 512GB phone running 200 apps. The dedicated device wins at its one job.
Layers work like a call stack. Data enters at layer 1, gets transformed, gets passed to layer 2, gets transformed again, and so on through 30-40 layers. Each layer is a function that takes the output of the previous layer and produces input for the next one. Early layers learn low-level patterns — syntax, token relationships. Deep layers learn high-level abstractions — meaning, intent, structure.
It’s the same concept as middleware in Express. Each layer processes and passes forward.
Tokens Are Not Words. They’re Like UTF-8 Codepoints.
You know how UTF-8 encodes characters as 1-4 bytes, and some characters take multiple bytes? Tokenization does the same thing to language.
“CreateUserHandler” might become 3-4 tokens: ["Create", "User", "Handler"]. Common words get one token. Rare words get split into pieces. The model never sees text. It sees a sequence of integer IDs, each mapping to a token in its vocabulary.
Context length — like 4096 or 128K tokens — is your buffer size. Longer context means more RAM per request. Everything the model can “see” at once has to fit in that buffer.
Attention Is a Database Query
The transformer architecture — what every modern LLM uses — has one core operation: attention.
Imagine every token in the input is a row in a database table. When the model processes a new token, it runs a query: “Which previous tokens are relevant to what I’m generating right now?”
Attention computes a weighted lookup across all previous tokens, pulling more from relevant ones and less from irrelevant ones.
The internal variables are literally named after database concepts: Q (Query), K (Key), V (Value). The Query is what you’re looking for. The Keys are the indexes. The Values are the data you retrieve.
Self-attention is a JOIN where a table queries itself.
Loss Is a Test Suite. Gradient Descent Is Git Bisect.
During training, the loss function measures “how wrong was the model’s prediction?” It’s your test suite. Every training step runs the model on a batch of examples, measures the loss (tests fail), and adjusts the weights to reduce the loss (fix the failing tests).
Gradient descent is the specific method. It calculates the direction each weight should move to reduce loss. Think of it as automated git bisect — you know something is wrong, you know the direction of the fix, you take a small step, check again, repeat.
The learning rate is your step size. Too big and you overshoot the fix. Too small and you converge so slowly you’ll die of old age waiting. The sweet spot is usually somewhere around 2e-5 (0.00002), which tells you how tiny these adjustments are.
Backpropagation Is the Chain Rule. That’s It.
If you remember one thing from calculus, the chain rule says: d/dx f(g(x)) = f'(g(x)) * g'(x).
A neural network is a composition of functions: layer3(layer2(layer1(x))). Backpropagation applies the chain rule backward through the network to figure out how much each weight contributed to the error.
It’s dependency tracking. Exactly like how your build system figures out which source file caused the compilation error.
The “back” in backpropagation means it starts at the output (where the error is measured) and works backward to the input (where the weights live), computing blame for each weight along the way.
Overfitting Is Memorization. Regularization Is Abstraction.
If you train a model too long on too little data, it memorizes the training examples instead of learning general patterns.
It’s like a junior developer who memorizes Stack Overflow answers instead of understanding the underlying concepts. Works perfectly on the examples they’ve seen. Falls apart on anything new.
Regularization techniques — dropout, weight decay, early stopping — force the model to learn generalizable patterns instead of memorizing. They’re the code review of machine learning. They say: “You can’t just hardcode the answer. You have to understand why.”
LoRA Is Dependency Injection for Weights
Full fine-tuning updates all 3 billion parameters. That’s rewriting your entire codebase to add a feature.
LoRA (Low-Rank Adaptation) freezes the original weights and injects small adapter matrices alongside them. It’s dependency injection. You’re not modifying the base class. You’re injecting a new implementation through a narrow interface.
The r (rank) parameter controls how narrow that interface is. r=8 means the adapter is an 8-dimensional bottleneck — very constrained, very efficient. r=16 gives it more expressiveness. The base model is the framework. The LoRA adapter is your application code plugged into it.
This is why you can fine-tune a 3B model on a consumer GPU. You’re not updating 3 billion parameters. You’re updating a few million adapter parameters while the rest stay frozen.
Quantization Is Compression
A model’s weights are originally 32-bit floats. Quantization compresses them to 8-bit, 4-bit, or even 2-bit integers. It’s JPEG for neural networks — you lose some precision, but the file shrinks 4-8x.
4-bit quantization takes a 3B model from ~6GB to ~2GB. The precision loss is surprisingly small because most weights don’t need 32 bits of precision. It’s the same insight as why JPEG works: most of the information is in the low-frequency components.
NF4 (Normal Float 4-bit) is the specific compression scheme optimized for how neural network weights are distributed — they tend to cluster around zero, so NF4 allocates more precision near zero and less at the extremes.
Temperature Is Risk Tolerance
When the model generates the next token, it computes a probability distribution over all possible next tokens. Temperature scales that distribution.
temperature=0: Always picks the most probable token. Deterministic. Safe. Boring.temperature=0.7: Samples with some randomness. Creative. Varied. Occasionally surprising.temperature=1.0: Samples proportionally to the probabilities. Full creative range.temperature=2.0: Flattens the distribution. Chaotic. Random.
For code generation, you want 0.1 — almost deterministic, because there’s usually one correct answer. For creative writing, you’d use 0.7-1.0. It’s the same concept as risk tolerance in portfolio management.
GRPO Is A/B Testing With Natural Selection
GRPO (Group Relative Policy Optimization) is a reinforcement learning technique for fine-tuning. Here’s how it maps to what you know:
Generate N different responses to the same prompt — that’s A/B/C/D testing. Score each response with reward functions — your automated test suite. The responses that score highest get reinforced. The model’s weights shift toward producing outputs like the winners. The responses that score lowest get suppressed.
It’s natural selection applied to text generation.
The predecessor, PPO, needed a separate “critic” model to evaluate quality — like having a dedicated QA team grade every single build. GRPO eliminates the critic by using relative ranking within the group. The best of this batch is “good.” The worst is “bad.” No external judge needed. This halves the memory requirement, which matters when you’re training on consumer hardware.
Why This Matters for What I’m Building
Most ML problems are hard because the output space is infinite and evaluation is subjective. “Write a good poem” has infinite valid outputs and no mechanical way to score them.
My problem is the opposite.
I built a grammar of architectural terms with validation rules that mechanically verify any specification. The output space is finite — a closed set of node types, connection labels, and patch operation types. The scoring is deterministic — a Python function checks all the rules in milliseconds and returns a binary pass/fail for each one.
So I fine-tuned a 3B model on that closed vocabulary with those validators as the reward signal. The model doesn’t need to know about cooking, history, or JavaScript trivia. It only needs to know 28 terms and 29 rules.
A 3B model that has forgotten everything except your domain should outperform a 70B model that knows everything, on that specific task. Because the general-purpose model is guessing. The fine-tuned model is constrained.
Constraints win.
The Cheat Sheet
| ML Concept | Your Mental Model |
|---|---|
| Neural network | Function with tunable knobs |
| Training | Compilation |
| Inference | Runtime |
| Parameters | Memory capacity |
| Layers | Middleware / call stack |
| Tokens | UTF-8 codepoints |
| Attention (Q, K, V) | Database JOIN on itself |
| Loss function | Test suite |
| Gradient descent | Git bisect |
| Backpropagation | Build system blame tracking |
| Learning rate | Step size |
| Overfitting | Memorizing Stack Overflow |
| Regularization | Code review |
| LoRA | Dependency injection |
| Quantization | JPEG compression |
| Temperature | Risk tolerance |
| GRPO | A/B testing + natural selection |
| Fine-tuning | Recompiling with a patch |
Where to Go From Here
If this clicked for you, the next step isn’t reading more theory. It’s picking a narrow, mechanically-verifiable task and fine-tuning a small model to do it. The tooling has gotten remarkably accessible:
- MLX runs natively on Apple Silicon. If you have an M-series Mac, you can fine-tune a 3B model overnight.
- QLoRA with
bitsandbyteslets you fine-tune on a consumer NVIDIA GPU with 8GB of VRAM. - Ollama lets you deploy the result locally with a single command.
The barrier isn’t hardware or math anymore. It’s knowing which knobs to turn. Now you do.
Jason Walker is building Loop Lock, the perfect A/V loop creator, and designing the standard of system design grammar. He writes about specification-driven engineering, solo development, and making AI do exactly what you tell it to. Interested in the grammar? Email stonecassette@gmail.com with the subject “Interested in your system design grammar.” Follow the work at jsonwalker.com.