Repo: github.com/midimurphdesigns/fedbench
Live demo: fedbench.kevinmurphywebdev.com
Read the full story: Building fedbench
An open-source evaluation harness for grounded LLM Q&A. The agent reads federal documents and answers questions about them; the harness scores hallucination, citation accuracy, and refusal discipline as first-class metrics. 21 verified Q&A pairs across two side-by-side public corpora (Medicare and OSHA), ~$0.025 per pair, replay path that runs in 1 second with no API key.
How it's built
Bun + TypeScript strict, Anthropic SDK, BM25 retrieval over chunked PDFs. Three layered checks per question: a deterministic citation check (does the agent's claimed page actually contain its answer's load-bearing tokens?), an LLM-as-judge run on Opus 4.7 (a stronger model than the agent's Sonnet 4.6, on purpose), and refusal correctness on an out-of-corpus split. Every model call goes through a Sonnet → Haiku fallback ladder with full provenance attached to the response: which rung answered, latency, cost, attempts that bailed.
Artifacts worth reading
- The agent prompt + system rules that enforce citation format and refusal phrasing
- The judge harness that grades whether a cited chunk supports the agent's claim
- The recordings file that powers the no-API-key replay path
The trade-offs
Custom evals over verified corpora are slower to author than vendor evals over synthetic data — but they catch the failures that actually break trust. The harness is opinionated on purpose: deterministic checks before LLM checks, recordings as audit artifacts, provenance on every call.