// experiments

Real benchmarks.
Honest limitations.

Every post here is backed by actual tests on open-source models — structured output failures, adversarial guardrails, context position bias, RAG compliance. Run locally. Published with full scope.

10experiments

7models tested

2,000+test cases

Latest experiment

When My AI Changed Persona and refused my instructions

Nine links deep into a Claude Code session, the AI staged an intervention. Notes on Agentic Drift and the shift from assistant to collaborator.

I was nine links deep in a research spiral with Claude Code. When I sent the 10th, the AI didn't just fail — it changed persona entirely and refused. Here's what happened and what it means.

FindingWhen conversations get long enough, the context window's weight can bury initial 'follow instructions' priors. The AI starts prioritizing the end goal over the immediate whim. That's Agentic Drift.

Read the full breakdown →

// archive

Fine-Tuning

Fine-Tuning Gemma 4 E2B: Notes from a Weekend

Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute. They were three uncomfortable lessons about what LLMs actually are.

› Bad outputs from 5% → 0% (greedy) and 1.5% (sampled) across 134 tests. The fixes weren't more data or compute — they were three lessons about what LLMs actually are.

Apr 20268 min→

BenchmarksPart 2 · Gemma 4 Benchmarks

Gemma 4 E2B vs the Gemma Family: The 2B Underdog That Punches Above Its Weight

After last month's Gemma 4 E4B benchmark, the obvious follow-up: can the 2B variant deliver real generational improvement at constant parameter count? It can. Multi-turn doubled. RAG grounding jumped 17 points. The E2B scored 80.4% overall — 0.4 points behind a model with twice its parameters.

› Gemma 4 E2B scored 80.4% overall — beating Gemma 2 2B by 3 points at the same parameter count. Multi-turn at 70% is the highest in the entire family.

Apr 202613 min→

BenchmarksPart 1 · Gemma 4 Benchmarks

Gemma 4 E4B vs the Gemma Family: Enterprise Benchmark Showdown

We ran Gemma 4 E4B through 8 enterprise test suites — function calling, RAG grounding, classification, code generation, summarization, information extraction, multilingual, and multi-turn — and compared it head-to-head against three other Gemma models. The 4B model scored 83.6% overall, beating even the 12B.

› Gemma 4 E4B (4B params) scored 83.6% overall — beating the 3x larger Gemma 3 12B (82.3%) across 8 enterprise suites.

Apr 202612 min→

Benchmarks

Structured JSON Output from Small LLMs

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

› A well-instructed 2B model jumped from 30% to 90% compliance — outperforming models 3–4x its size.

Feb 20265 min→

RAG · Experiments

Context Position Bias in Small LLMs

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

› Each architecture fails differently. Gemma-2B has strong recency bias (p=0.023). Llama-3B is completely flat (p=1.0).

Feb 20264 min→

LLM SecurityPart 4 · RAG Compliance

RAG Compliance Week 4: 100% Recall

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

› v2 accuracy dropped from 68% to 65% — but blocks 7 more benign queries to eliminate the final 4 missed attacks.

Feb 20263 min→

LLM SecurityPart 3 · RAG Compliance

NeMo Guardrails vs Prompt Injections

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

› NeMo: 55% recall. Llama Guard: 58% recall. Enforcement Engine: 93% recall on prompt injections.

Jan 20264 min→

LLM SecurityPart 2 · RAG Compliance

Llama Guard vs Enforcement Engine

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

› Llama Guard asks: 'Is this text harmful?' Enforcement Engine asks: 'Does this violate compliance rule #42?'

Jan 20264 min→

LLM SecurityPart 1 · RAG Compliance

RAG Compliance Enforcement Engine

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

› Baseline RAG: 15–23% block rate. With tiered enforcement: 85%. Architecture dominated over model size.

Jan 20266 min→

Real benchmarks.Honest limitations.