// projects

Built & tested.

Every project here backs a blog post with real data. Enforcement engines, voice AI, interview prep, and model benchmarks. Source code available on request.

01LLM Policy Enforcement Engine

Enforcement Engine

A 4-tier verification pipeline that enforces domain-specific KB rules on LLM outputs and blocks prompt injection attacks with 100% recall across 185 test cases.

100%Attack recall
490Test cases
82KB rules

Results & architecture

Key features

Tier 0: XGBoost injection detection on CPU (~0.5ms)

Tier 1: Go sentinel for regex + obfuscation detection

Tier 2: Semantic routing with activation steering

Tier 3: LLM generation with NLI verification

42-45% better recall than Llama Guard 3 and NeMo Guardrails

Zero false positives on benign queries

Tech stack

PythonFastAPIGoXGBoostDeBERTa NLIgRPCSentence Transformers
Private repoRequest access →
02AI Legal Analysis with Real-Time Voice

Contract Paranoia

Real-time voice-based contract analysis using Google Gemini Live API and Agent Development Kit. Users talk to "Para," an AI legal buddy that flags risky clauses with search-grounded citations.

~1.6sLatency
3Risk levels
Full-duplexVoice

Results & architecture

Key features

Bidirectional voice with interruption support

Multi-agent: root agent + analyzer sub-agent

RED / YELLOW / GREEN clause risk flagging

Google Search grounding prevents hallucinated legal advice

Judge Agent audits quality with 8-point rubric

Session persistence with conversation recovery on drops

Tech stack

ReactTypeScriptFastAPIGemini Live APIGoogle ADKWebSocketDockerCloud Run
Private repoRequest access →
03AI Mock Interview Platform

PrepVoice

Full-stack AI interview prep platform with real-time voice interaction. Analyzes job descriptions and resumes, conducts adaptive mock interviews, and tracks readiness progression.

5Scoring dims
3LLM backends
6+Domains

Key features

Real-time voice interviews with follow-up questions

Multi-dimensional scoring (technical, communication, depth, JD relevance, STAR)

Gap analysis between JD requirements and resume

Body language feedback via MediaPipe

Level-aware questions from junior to director

Session replay with full transcripts and scores

Tech stack

Next.js 14TypeScriptFastAPIPostgreSQLOllamaClaudeGeminiWeb Speech API
Private repoRequest access →
041,500+ Tests on Small LLM JSON Generation

Structured Output JSON

Rigorous test harness proving small LLMs fail at JSON generation in silent, dangerous ways. A well-instructed 2B model jumped from 30% to 90% compliance, outperforming 7B models on defaults.

1,500+Tests
7Models
8KB rules

Results & architecture

Key features

Multi-backend: HuggingFace Transformers + Ollama

Detects parse failures, hallucinated fields, type mismatches, silent failures

Ablation study showing 1 rule held accuracy together

4 progressively complex JSON schemas tested

JSON Mode degraded 2 of 3 models tested

95% confidence intervals with proper statistics

Tech stack

PythonPyTorchHuggingFaceOllamaMatplotlibJSON Schema
Private repoRequest access →
05Context Position Bias in Small LLMs

Lost in the Middle

Empirical study testing whether the "Lost in the Middle" phenomenon from GPT-scale papers applies to 2-4B models. Each architecture shows distinct position-handling behavior — the classic U-curve does not appear.

~500/modelTrials
7Positions
3Models

Key features

Gemma-2B: strong recency bias (p=0.023)

Llama-3B: completely flat — no position effect (p=1.0)

Gemma-4B: weak middle dip, not statistically significant

7 hard semantic distractors per QA pair

70-100 document contexts (~7-10K tokens)

Replication of Liu et al., 2023 on smaller models

Tech stack

PythonPyTorchHuggingFaceSciPyMatplotlib
Private repoRequest access →