Small models, sharp lessons

Everyone fine-tunes. Almost nobody trains from scratch anymore, and for good economic reasons. But spending a few weeks training a 35M-parameter model on local hardware taught me more about how these systems actually work than a year of calling inference APIs.

The abstraction tax

When you fine-tune through an API, the vendor has made every interesting decision for you: tokenizer, architecture, data mixture, optimization schedule. You're adjusting the last few percent of a system someone else designed. That's efficient, but it means your mental model of the system is borrowed, not built.

Training from scratch — even at toy scale — forces you to pay down that debt. You learn that the tokenizer isn't preprocessing, it's architecture: a vocabulary tuned to your corpus effectively lengthens your context window and reallocates parameters from memorizing subword junk to modeling structure. You learn that loss curves lie unless your data pipeline is deterministic. You learn what a bad learning-rate schedule feels like three hours into a run.

Constraints are the curriculum

At 35M parameters, nothing is free. Every architectural indulgence shows up immediately as a worse model, because there's no capacity slack to absorb your mistakes. This is the opposite of working at scale, where over-parameterization forgives almost everything.

Three things dominated outcomes, in order:

Data quality. Deduplication and filtering beat every architecture tweak I tried. It wasn't close.
Tokenizer fit. A domain-tuned BPE vocabulary outperformed a general-purpose one of the same size, consistently.
Instrumentation. The single highest-leverage component was logging that made every regression attributable to a specific change. Optimization without attribution is gambling.

None of these lessons are novel — they're in the literature. But there's a difference between having read something and having paid for it in GPU-hours.

Why this matters for engineers

Most of my work is systems engineering, not ML research. The reason this exercise was worth it is that AI-adjacent systems are becoming normal backend infrastructure, and the engineers who understand what's under the API make better decisions about latency budgets, failure modes, evaluation, and cost. You don't need to train frontier models. You need to have trained something, end to end, small enough that every failure was unmistakably yours.