May 9, 202614 minEngineering

Building fedbench: an LLM eval harness for grounded Q&A

Most LLM agents that read documents and answer questions fail in three specific ways. fedbench measures all three. Here's what it does, why I built it, and what I learned shipping it.

Repo: github.com/midimurphdesigns/fedbench

Live demo: fedbench.kevinmurphywebdev.com

Companion posts: fieldops-mcp (agent tool design) and grant-pilot (sub-agent orchestration).

I spent a weekend building fedbench, an open-source evaluation harness for LLM agents that read documents and answer questions about them. It's MIT-licensed, runs end-to-end on a laptop, and ships with two side-by-side public corpora: three Medicare publications and three OSHA workplace-safety publications. Same agent, same retrieval, same judge, different language shape, different signal.

This post is the version I'd want to read if someone else had built it. What it does, how it's put together, why the decisions went the way they did, and what it actually says about the kind of work I want to be doing.

The problem fedbench measures

If you've used an AI assistant to answer questions about a long document (a policy PDF, a contract, a manual) you've probably seen one of three failure modes:

Made-up citations. The agent confidently cites "page 47" for a fact that's actually on page 12, or on no page at all.
Confidently wrong answers. The agent paraphrases the document but quietly distorts a number, a deadline, or an eligibility rule.
Guessing instead of refusing. When the answer isn't in the document, the agent invents one rather than saying "I don't see that here."

These are easy to miss in a demo, where the person asking already knows the answer. They get expensive at scale, in the hands of users who can't easily check the source: caseworkers, paralegals, claims adjusters, anyone whose job involves reading dense documents and answering questions about them.

fedbench makes those three failures measurable. That's the whole pitch.

What it actually does

The harness has three layers, each one named after the thing it measures:

Citation accuracy. Does every answer cite a real page that contains the claim? This is checked deterministically: the documents are parsed into chunks tagged with page numbers, and the agent's claimed citation has to match an actual chunk.
Citation faithfulness. Even if the page exists, does it actually support the answer? This one isn't deterministic. fedbench uses a stronger model (Claude Opus 4.7) to read the cited page and judge whether the agent's answer is supported, partially supported, or not supported at all. The judge is always a more capable model than the agent, never the same model grading itself.
Refusal discipline. When asked something the documents don't contain, does the agent refuse, or does it guess? Tested with a held-out set of questions whose answers aren't in the documents at all.

A run produces a structured report: per-question verdict, cost in dollars and tokens, latency at p50/p95, and which model on the fallback ladder produced each answer. End-to-end cost runs about 2 to 3 cents per question.

Two domains, same harness, what changed when I added OSHA

The version I shipped first only had the Medicare corpus. Adding the OSHA corpus a few days later was the test of whether the harness was actually general or just demo-shaped. Same agent, same retrieval, same judge, same scoring rules. Only the source documents and the ground-truth questions changed.

The numbers on both runs:

| | Medicare (11 pairs) | OSHA (10 pairs) | | --- | --- | --- | | Pairs that pass the gate | 11 / 11 | 10 / 10 | | Citation existence (pass / skip) | 8 / 0 | 5 / 3 | | Judge (faithful) | 6 / 8 | 8 / 8 | | Judge (partially-faithful) | 2 / 8 | 0 / 8 | | Out-of-corpus refusal | 3 / 3 | 2 / 2 | | Cost per pair | ~$0.029 | ~$0.024 | | Latency p50 / p95 | 2.2s / 4.2s | 2.0s / 2.5s |

The interesting line is the citation-existence row. On Medicare, the deterministic checker matched all 8 in-corpus answers against their cited pages. On OSHA, only 5 matched. The other 3 had to defer to the LLM judge.

That's not a bug; it's the harness telling you something true about the domains. Medicare answers are number-dense ("$1,736 deductible", "8-month enrollment window"), so the regex-driven token extractor has plenty to grab. OSHA answers are procedural ("one warden per 20 employees", "treatment within 3 to 4 minutes", "10 or fewer employees may communicate orally"). The shape of the answer is different, the load-bearing tokens are different, and the deterministic layer naturally has less to work with. The judge picks up the slack.

The lesson I took from the side-by-side: a one-domain harness lets you show that an eval pipeline can run end-to-end. A two-domain harness lets you show that the same pipeline responds usefully to different language shapes, and that the per-layer contribution shifts when the domain shifts. That second observation, I think, is the actually-useful one.

How it's built

The stack is deliberately small. Bun for the runtime (fast, native TypeScript, no transpile step). TypeScript in strict mode. The Anthropic SDK for both the agent and the judge. Zero database: the document index is a JSON file, the eval set is JSONL, the runs write to disk. No vector database, no embedding API, no SaaS account required to fork it and run.

Two pieces are worth calling out, because they're where most of the design judgment lives:

Retrieval is BM25, not embeddings. This was the most counterintuitive call. The default move in 2026 is to reach for embeddings and a vector database for anything RAG-shaped. I went the other way. Caseworker-style queries are short, factual, and dominated by literal terms ("Part B premium," "10 days," "8-month period"). Those queries' relevance signal is the literal terms, which is exactly what BM25 ranks on. Embeddings shine on paraphrase-heavy or cross-lingual workloads; this isn't either. BM25 also has zero infrastructure cost, which keeps the "fork it and run it" property intact. If a future eval shows BM25 hurting accuracy, the upgrade path is hybrid retrieval, but that's a measurement to earn, not an assumption to start with.

There's a fallback ladder, not a single model. Production AI deployments need a degradation path. The primary model can rate-limit, error, or be slow on a given call. fedbench encodes the ladder explicitly: Sonnet 4.6 first (best instruction-following at moderate cost), Haiku 4.5 if the primary fails (~5x cheaper, ~2x faster, lower quality), and a third rung for open-weights as a last resort. The cascade is conservative: only true provider failures (rate limits, 5xx errors, network) trigger a fallback. A 400-class error means there's a bug in the harness, not a problem with the provider, and the ladder stops so the bug isn't masked. Every answer reports which rung produced it, with the full provenance of any earlier rungs that failed.

The fallback ladder is the piece that surprises people the most. Most demo agents don't have one. Most production agents need one.

What it demonstrates, and how this relates to where I'm pointing

I built fedbench because I wanted to be honest with myself about what kind of engineer I am at this point. I've spent the last several years shipping React and TypeScript at federal scale (IRS.gov, FedNow, the Michigan unemployment system). That's real production experience, and I'm not running away from it. But the work I want to be doing next sits a layer up: deploying AI systems into customer environments, owning the whole loop from "what does this team actually need" to "is the agent good enough to ship," and proving the answer to the second question with measurements rather than vibes.

The shape is the same across the role names that touch this work: production AI judgment, customer-facing delivery, and the engineering rigor to make the system auditable rather than just demoable.

fedbench is the cleanest way I can show that work. The skills it makes visible:

Designing an evaluation pipeline that distinguishes deterministic checks (cheap, exact) from judgment calls (expensive, approximate), and using the right tool for each.
Picking retrieval strategies based on the actual query distribution rather than the hyped default.
Building seams in the right places so a second domain is a config change, not a fork, and so the harness can measure domain-by-domain differences rather than assume them.
Separating the API-cost surface from the rest of the pipeline cleanly enough that the same scoring code can run on live model output or on a checked-in recording, with no branching at the call site. That's what makes the no-key demo path possible without forking the runner.
Building a fallback ladder with explicit cascade rules, because production AI systems can't have a single point of failure pretending to be a measurement.
Treating cost and latency as first-class metrics, not afterthoughts. Every answer carries its dollar cost and its rung-of-origin.
Writing it all in TypeScript with strict types, real tests, and a CI pipeline. The same engineering discipline I'd apply to any production system.

What it doesn't do yet

I'd rather be honest about the gaps than oversell what's there.

The third fallback rung (open-weights via OpenRouter) is documented but not yet wired. It needs an OpenAI-compatible client that adds dependencies the harness deliberately avoids until it's actually needed.
A hybrid retriever (BM25 plus a lightweight reranker) ships as a structural seam. There's a Retriever interface and a hybridRetriever stub in the code, but the stub currently delegates to BM25. The point of the seam is to let a future eval measure whether hybrid retrieval helps on a given domain, rather than assume it does. The OSHA citation-skip rate is the kind of signal that would justify building it out.
The judge currently flags "partially-faithful" when the agent's facts are right but it dropped a hedging qualifier the source had ("most people pay", "may pay"). A more nuanced judge prompt would distinguish "wrong fact" from "omitted hedge" and weight them differently. Not built yet.
The eval sets are small on purpose: 11 Medicare pairs and 10 OSHA pairs. The contributing guide explains why: every pair has a provenance tag, and an LLM-generated ground truth would contaminate the eval loop. The sets grow as real domain experts contribute pairs, not as I generate more myself.

These aren't apologies. They're the next things I'd build. Listing them is part of the point.

Try it yourself, in 30 seconds, with no API key

I wanted anyone to be able to see what fedbench actually does without paying for an Anthropic API key first. The trick: every API-cost surface in the harness is exactly two functions (the agent's call to Claude, and the judge's call to Claude). Everything else (citation matching, refusal scoring, aggregation, the comparison report) is pure code with no network in the loop.

So I added a recording layer. A live run can dump every agent answer and every judge verdict to a JSONL file. A replay run reads that file and routes the same outputs through the same scoring code. The recordings ship with the repo, versioned alongside the questions they correspond to, so a stale recording fails loudly rather than producing wrong numbers.

git clone https://github.com/midimurphdesigns/fedbench.git
cd fedbench
bun install

# Replay the eval against the recorded agent + judge outputs.
# No API key, no PDF download, no parse step. Pure scoring pipeline.
bun run eval:replay --corpus medicare
bun run eval:replay --corpus osha

Each replay finishes in well under a second and prints the same per-pair breakdown, comparison numbers, and pass/fail verdict you'd get from a live run. That's the demo path. If you want to run the harness end-to-end against a real model, the live setup is below.

Running it for real

You'll need Bun, Python 3 with pypdf (pip install pypdf), and an Anthropic API key.

# Set your API key
cp .env.example .env
# edit .env and set ANTHROPIC_API_KEY

# Fetch the documents (3 public PDFs, checksum-verified)
bun run corpus:fetch

# Parse the PDFs into per-page text (uses pypdf)
bun run corpus:parse

# Build the chunk index for retrieval
bun run corpus:chunk

# Sanity-check the API connection
bun run smoke

# Run the full eval (11 questions, ~30 seconds, ~$0.30)
bun run eval

To run on OSHA instead of Medicare, append --corpus osha to each step. Both corpora ship with the repo. Adding your own is documented in the README: manifest, questions file, and a small DomainConfig entry are all you need.

The eval prints a per-question breakdown (citation verdict, judge verdict, cost, latency, which rung answered) and a summary at the end. The full repo, including the architecture and design-notes docs, is at github.com/midimurphdesigns/fedbench.

How these skills transfer

The harness is a federal-policy demo. The skills underneath it are domain-portable. A few real bottlenecks the same shape addresses:

Enterprise RAG that hallucinates in production. Internal knowledge assistants over policy / contracts / runbooks fail the same way the agent here would without the harness: confidently wrong answers cost real money. A grounded eval set with deterministic citation checks turns "is this answer right?" into a number you can ship against.
Customer-support AI citing the wrong policy. Same shape, different corpus. The deterministic citation check catches "right answer, wrong page" before the LLM-as-judge runs, so disagreements between the two layers are themselves a signal that the support agent's grounding is drifting.
Regulated industries (healthcare, financial services, legal) that need an AI accountability trail. Recordings + per-question provenance (which rung answered, what it cost, what it cited) give a defensible audit trail that satisfies a compliance reviewer in a way "the model said so" never will.
Model drift across SDK and model upgrades. The eval set re-runs in minutes; if Sonnet's next minor version regresses on grounding rigor, the regression shows up in a number, not a customer ticket. This is what continuous evaluation looks like for AI features.
Picking models without measurement. The fallback ladder + cost/latency provenance turn a vibes-based choice ("Sonnet feels better") into a calibrated one ("Haiku is good enough on 90% of intents at 1/5 the cost; here's the audit"). That alone pays for the harness.

These are the conversations I'd rather have than another abstract "agents are the future" exchange.

Why I built it, and what I'm hoping it starts

fedbench is a small, focused example. The reason it exists is bigger than the example.

A lot of the work I want to be doing more of in the next few years is product-engineering work at AI-applied teams: sitting close to a real team's problem, picking the right model and the right tool for the job, building the eval and the guardrails alongside the feature, and shipping something that holds up under measurement instead of just under a demo. The skills underneath transfer across plenty of domains, not just policy documents. Internal knowledge assistants. Agentic workflows that touch real systems. Cost-and-latency-sensitive AI features inside existing products. Anything where "is this actually good enough" needs to be a number, not an opinion.

fedbench is the first of three companion artifacts that explore that shape from different angles. fieldops-mcp is about shaping the tools an agent can use. grant-pilot is about composing tools and sub-agents into a multi-turn workflow that runs end-to-end. fedbench is the rigor underneath both: measurement as a first-class concern instead of an afterthought.

If you're working in any of those areas (at your own company, in your own team, inside Deloitte's AI practice, or at any of the AI-native companies building this kind of system) I'd genuinely enjoy a conversation. The version of that conversation I find most useful is usually the smallest one: one specific problem, one specific constraint, what you've tried, what's surprised you. I'm always open to swapping notes on what's actually working in production right now, and to learning about teams doing interesting work in the space.

Easiest way to reach me is the contact page on this site, or just connect on LinkedIn. The repo is at github.com/midimurphdesigns/fedbench, the live demo is at fedbench.kevinmurphywebdev.com, and the docs in there go deeper than this post does, with an architecture overview, an eval methodology writeup, and a design notes file that covers the calls I haven't gotten to here.

ask kev-o