Open-source and personal projects2026

fedbench — LLM eval harness for grounded Q&A

Public open-source evaluation harness for grounded Q&A agents over policy PDFs. Two side-by-side federal corpora (Medicare + OSHA, 21 verified Q&A pairs total), BM25 retrieval, fallback ladder, deterministic citation-check + Opus 4.7 LLM-as-judge, and a no-API-key replay path so visitors can demo the scoring pipeline in under a second.

Author · Applied AITypeScriptBunAnthropic SDKMCPBM25LLM-as-judge

Repo: github.com/midimurphdesigns/fedbench

Live demo: fedbench.kevinmurphywebdev.com

Read the full story: Building fedbench

An open-source evaluation harness for grounded LLM Q&A. The agent reads federal documents and answers questions about them; the harness scores hallucination, citation accuracy, and refusal discipline as first-class metrics. 21 verified Q&A pairs across two side-by-side public corpora (Medicare and OSHA), ~$0.025 per pair, replay path that runs in 1 second with no API key.

How it's built

Bun + TypeScript strict, Anthropic SDK, BM25 retrieval over chunked PDFs. Three layered checks per question: a deterministic citation check (does the agent's claimed page actually contain its answer's load-bearing tokens?), an LLM-as-judge run on Opus 4.7 (a stronger model than the agent's Sonnet 4.6, on purpose), and refusal correctness on an out-of-corpus split. Every model call goes through a Sonnet → Haiku fallback ladder with full provenance attached to the response: which rung answered, latency, cost, attempts that bailed.

Artifacts worth reading

The agent prompt + system rules that enforce citation format and refusal phrasing
The judge harness that grades whether a cited chunk supports the agent's claim
The recordings file that powers the no-API-key replay path

The trade-offs

Custom evals over verified corpora are slower to author than vendor evals over synthetic data — but they catch the failures that actually break trust. The harness is opinionated on purpose: deterministic checks before LLM checks, recordings as audit artifacts, provenance on every call.

ask kev-o