// experiments

Real benchmarks.
Honest limitations.

Every post here is backed by actual tests on open-source models — structured output failures, adversarial guardrails, context position bias, RAG compliance. Run locally. Published with full scope.

7experiments
7models tested
2,000+test cases

// archive

02
Benchmarks

Structured JSON Output from Small LLMs

You know that feeling when you ask an AI to return data in a specific structure, and everything looks clean — but the actual content is quietly wrong? I ran 1,500+ tests across 7 small open-source models.

A well-instructed 2B model jumped from 30% to 90% compliance — outperforming models 3–4x its size.

Feb 20265 min
03
RAG · Experiments

Context Position Bias in Small LLMs

The "Lost in the Middle" paper showed that large models perform worst when important information is buried in the middle of long contexts. I tested whether small 2–4B models behave the same way. They don't.

Each architecture fails differently. Gemma-2B has strong recency bias (p=0.023). Llama-3B is completely flat (p=1.0).

Feb 20264 min
04
LLM Security

RAG Compliance Week 4: 100% Recall

Week 1: 80% F1. Week 2: Llama Guard hit 53% F1. Week 3: Prompt injection testing. NeMo hit 55% recall. Enforcement engine hit 93%. 4 attacks still got through. Today: 100% recall. 0 missed.

v2 accuracy dropped from 68% to 65% — but blocks 7 more benign queries to eliminate the final 4 missed attacks.

Feb 20263 min
05
LLM Security

NeMo Guardrails vs Prompt Injections

Week 3 of the RAG compliance series. I ran two separate tests: 17 high-risk compliance queries and 85 prompt injection attacks. The head-to-head results were eye-opening.

NeMo: 55% recall. Llama Guard: 58% recall. Enforcement Engine: 93% recall on prompt injections.

Jan 20264 min
06
LLM Security

Llama Guard vs Enforcement Engine

I ran a head-to-head benchmark using the same 17 adversarial queries and 82 compliance rules. Llama Guard 3: 53% F1. Enforcement Engine: 80% F1. The gap comes down to what the model is 'looking' for.

Llama Guard asks: 'Is this text harmful?' Enforcement Engine asks: 'Does this violate compliance rule #42?'

Jan 20264 min
07
LLM Security

RAG Compliance Enforcement Engine

Two posts convinced me that RAG alone isn't enough for compliance. So I tested it. Baseline RAG blocked 15–23% of violations. With the enforcement layer: 85%. Architecture mattered more than model size.

Baseline RAG: 15–23% block rate. With tiered enforcement: 85%. Architecture dominated over model size.

Jan 20266 min