Structured JSON Output from Small LLMs

1,500+ tests across 7 models. Forcing JSON Mode degraded 2 of 3 models. A 2B model beat a 7B on defaults.

Scope & limitations — read first

1,500+ tests · 7 models (2B to 9B) · open-source only · run locally on Apple Silicon

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong?

I ran 1,500+ tests across 7 small open-source models (2B to 9B parameters). These are the models teams actually self-host to save costs and keep data private. Here's what I found:

4-panel comparison dashboard — structured output reliability across 7 models
4-panel comparison dashboard — structured output reliability across 7 models

1. Forcing a strict format can backfire

Most assume that forcing a strict format (like "JSON Mode") improves accuracy. It often does the opposite. Two out of three models got worse when I forced strict formatting. One 9B model dropped from perfect accuracy to 92%. The output looked prettier, but the reasoning was less reliable.

2. A 2B model beat a model 4x its size

A tiny 2B model scored 90% accuracy with the right guidance. A 7B model only managed 52% on its own.

At this scale, architecture and instructions matter more than raw parameter count. As models get larger, this gap narrows — but even 20B+ models still benefit from structured guidance on complex tasks.

Compliance vs model size — KB rules close the gap between 2B and 9B models
Compliance vs model size — KB rules close the gap between 2B and 9B models
Gemma-2B highlight — 30% → 90% jump with 8 KB rules
Gemma-2B highlight — 30% → 90% jump with 8 KB rules

3. Every small model family fails differently

One family kept getting data types wrong, while another invented entirely new fields. The baseline 2B model hallucinated fields 70% of the time, while the 7B model hit 38%. If you run small models for cost or privacy, you must know which specific mistakes your model family tends to make.

4. The "Complexity Cliff"

On basic requests, small models scored 100%. But on edge cases — the kind you actually encounter in production — accuracy dropped to 67%. This is where small models need the most help.

5. Only 1 rule actually held it all together

Ablation study — removing each rule one by one to find the critical one
Ablation study — removing each rule one by one to find the critical one
Summary table of all 1,500+ tests across 7 models
Summary table of all 1,500+ tests across 7 models

I tested 8 guidance rules by removing them one by one. Seven made no difference. One specific rule — telling the model not to confuse data structure definitions with actual data — was the only thing holding accuracy together. Without it, compliance dropped immediately.

You don't always need the biggest model. A well-instructed 2B model jumped from 30% to 90% compliance with the right guidance — outperforming models 3–4x its size running on defaults.

Open questions

Does this pattern hold for larger models (20B+)?

What other single instructions have outsized impact on compliance?

How do these results compare to commercial API models?