Context Position Bias in Small LLMs

Stanford showed GPT-3.5 loses middle-context info. Do small open-source models behave the same way?

Scope & limitations — read first

3 models · 7 positions tested · ~500 trials per model · run locally on Apple Silicon · replication of Liu et al., 2023

The "Lost in the Middle" paper (Liu et al., 2023) showed that large models like GPT-3.5 perform worst when important information is buried in the middle of long contexts — a U-shaped accuracy curve.

Do small open-source models (2–4B parameters) behave the same way?

The Setup

Accuracy by document position — Gemma-2B recency bias, Gemma-4B middle dip, Llama-3B flat
Accuracy by document position — Gemma-2B recency bias, Gemma-4B middle dip, Llama-3B flat
  • Models: Gemma-2B, Gemma-4B, Llama-3B
  • Context: 70–100 documents per prompt with 7 hard distractors per question
  • Scale: 7 positions tested, ~500 trials per model
  • Hardware: Run locally on Apple Silicon

Results

Position bias heatmap across models and gold document positions
Position bias heatmap across models and gold document positions
Early vs late position accuracy — statistical significance per model
Early vs late position accuracy — statistical significance per model
Expected U-curve vs actual results for small models
Expected U-curve vs actual results for small models

Gemma-2B (Recency Bias)

Worst at the beginning (81.9%), best at the end (97.2%). Statistically significant preference for recent info (p=0.023).

Gemma-4B (Weak Middle Dip)

Similar upward trend, but its weakest point was actually position 50 (88.9%) — hinting at a mild middle dip. Not statistically significant (p=0.198).

Llama-3B (Flat/Stable)

Completely flat. Early and late positions were identical. No significant position effect (p=1.0).

The Lesson

Findings from GPT-scale papers don't always apply to 2–4B parameter models.
  • Run the stats: My first pass at n=30 showed "trends" that vanished or reversed at n=72.
  • Know your model: If you use Gemma-2B, document ordering is critical (put the best chunk last). If you use Llama-3B, it's far less sensitive.

Open questions

Does position bias change with different chunking strategies?

How does this interact with retrieval ranking in production RAG?

Do fine-tuned variants of these models show different patterns?