← All posts

Training a 450M Parameter LLM on a Single GPU: What I Learned

16 Apr 2026

I built sLLM Trainer to make local LLM training accessible. Here's the architecture decisions, the VRAM gotchas, and what BF16 vs FP16 actually means in practice.

Why I Built This

Every LLM training guide assumes you have access to a cluster. I wanted to train on a single consumer GPU — and document what actually works.

The Architecture Decisions

Pre-norm vs Post-norm

I went with pre-normalization (LayerNorm before attention, not after). This is what GPT-2 uses, and it's more stable during training on smaller hardware.

class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.ln1 = nn.LayerNorm(config.n_embd) # pre-norm
self.attn = CausalSelfAttention(config)
self.ln2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)

def forward(self, x):
x = x + self.attn(self.ln1(x)) # residual + pre-norm
x = x + self.mlp(self.ln2(x))
return x

BF16 vs FP16: The Practical Difference

FormatRangePrecisionBest for
FP32WideHighReference
FP16NarrowMediumInference
BF16WideLowerTraining

BF16 has the same exponent range as FP32. This matters because gradient updates during training can be very small — FP16 underflows, BF16 doesn't. Use BF16 for training, FP16 for inference.

The VRAM Gotcha Nobody Mentions

Gradient checkpointing saves ~40% VRAM but adds ~20% compute time. Worth it. Always enable it if you're under 24GB VRAM.

model.gradient_checkpointing_enable()

At 450M params I was running at ~18GB with BF16 + gradient checkpointing. Without it: OOM at 16GB.