Context Position Bias in Small LLMs

The "Lost in the Middle" paper (Liu et al., 2023) showed that large models like GPT-3.5 perform worst when important information is buried in the middle of long contexts — a U-shaped accuracy curve.

Do small open-source models (2–4B parameters) behave the same way?

The Setup

Accuracy by document position — Gemma-2B recency bias, Gemma-4B middle dip, Llama-3B flat

Models: Gemma-2B, Gemma-4B, Llama-3B
Context: 70–100 documents per prompt with 7 hard distractors per question
Scale: 7 positions tested, ~500 trials per model
Hardware: Run locally on Apple Silicon

Results

Position bias heatmap across models and gold document positions

Early vs late position accuracy — statistical significance per model

Expected U-curve vs actual results for small models

Gemma-2B (Recency Bias)

Worst at the beginning (81.9%), best at the end (97.2%). Statistically significant preference for recent info (p=0.023).

Gemma-4B (Weak Middle Dip)

Similar upward trend, but its weakest point was actually position 50 (88.9%) — hinting at a mild middle dip. Not statistically significant (p=0.198).

Llama-3B (Flat/Stable)

Completely flat. Early and late positions were identical. No significant position effect (p=1.0).

The Lesson

Findings from GPT-scale papers don't always apply to 2–4B parameter models.

Run the stats: My first pass at n=30 showed "trends" that vanished or reversed at n=72.
Know your model: If you use Gemma-2B, document ordering is critical (put the best chunk last). If you use Llama-3B, it's far less sensitive.

Open questions

→

Does position bias change with different chunking strategies?

→

How does this interact with retrieval ranking in production RAG?

→

Do fine-tuned variants of these models show different patterns?