Training a 450M Parameter LLM on a Single GPU: What I Learned

I built sLLM Trainer to make local LLM training accessible. Here's the architecture decisions, the VRAM gotchas, and what BF16 vs FP16 actually means in practice.

Why I Built This

Every LLM training guide assumes you have access to a cluster. I wanted to train on a single consumer GPU — and document what actually works.

The Architecture Decisions

Pre-norm vs Post-norm

I went with pre-normalization (LayerNorm before attention, not after). This is what GPT-2 uses, and it's more stable during training on smaller hardware.

class TransformerBlock(nn.Module):

def __init__(self, config):

super().__init__()

self.ln1 = nn.LayerNorm(config.n_embd) # pre-norm

self.attn = CausalSelfAttention(config)

self.ln2 = nn.LayerNorm(config.n_embd)

self.mlp = MLP(config)

def forward(self, x):

x = x + self.attn(self.ln1(x)) # residual + pre-norm

x = x + self.mlp(self.ln2(x))

return x

BF16 vs FP16: The Practical Difference

FormatRangePrecisionBest for
FP32	Wide	High	Reference
FP16	Narrow	Medium	Inference
BF16	Wide	Lower	Training

BF16 has the same exponent range as FP32. This matters because gradient updates during training can be very small — FP16 underflows, BF16 doesn't. Use BF16 for training, FP16 for inference.

The VRAM Gotcha Nobody Mentions

Gradient checkpointing saves ~40% VRAM but adds ~20% compute time. Worth it. Always enable it if you're under 24GB VRAM.

model.gradient_checkpointing_enable()

At 450M params I was running at ~18GB with BF16 + gradient checkpointing. Without it: OOM at 16GB.