I built sLLM Trainer to make local LLM training accessible. Here's the architecture decisions, the VRAM gotchas, and what BF16 vs FP16 actually means in practice.
Why I Built This
Every LLM training guide assumes you have access to a cluster. I wanted to train on a single consumer GPU — and document what actually works.
The Architecture Decisions
Pre-norm vs Post-norm
I went with pre-normalization (LayerNorm before attention, not after). This is what GPT-2 uses, and it's more stable during training on smaller hardware.
BF16 vs FP16: The Practical Difference
| FormatRangePrecisionBest for | |||
| FP32 | Wide | High | Reference |
| FP16 | Narrow | Medium | Inference |
| BF16 | Wide | Lower | Training |
BF16 has the same exponent range as FP32. This matters because gradient updates during training can be very small — FP16 underflows, BF16 doesn't. Use BF16 for training, FP16 for inference.
The VRAM Gotcha Nobody Mentions
Gradient checkpointing saves ~40% VRAM but adds ~20% compute time. Worth it. Always enable it if you're under 24GB VRAM.
At 450M params I was running at ~18GB with BF16 + gradient checkpointing. Without it: OOM at 16GB.