Skip to content
back to blog

10 minEngineering

Building grant-pilot: a multi-turn agent that orchestrates three sub-agents over real federal data

Third in the trilogy. fedbench measured agents. fieldops-mcp shaped what they could do. grant-pilot is what happens when an agent actually has to run a workflow end-to-end — discovery, eligibility, drafter, all dispatched by a planner over public grants.gov + SAM.gov data. Here's what I built and what shipping it taught me.

Repo: github.com/midimurphdesigns/grant-pilot

Live demo: grant-pilot.kevinmurphywebdev.com

Part of a three-project trilogy demonstrating the FDE / Applied AI shape — see also fedbench (eval rigor) and fieldops-mcp (agent tool design).

grant-pilot is the third project in a trilogy with fedbench and fieldops-mcp. fedbench is about measuring whether an agent is right. fieldops-mcp is about shaping what the agent can do at all. grant-pilot is what those look like composed: an agent that has to actually run a workflow, multi-turn, with real APIs, with structured failures, with a hosted demo strangers can run without trusting me with credentials.

What it does

A small business or nonprofit picks one of five intents (e.g. "I run a 12-person construction firm in Arizona, what infrastructure-related grants might fit?"). The agent does three things a grants consultant does:

  1. Discovery. Derives a keyword query from the intent, runs grants.gov, ranks 5 candidates 0–100 with one-line rationale per candidate.
  2. Eligibility. For the top 3 candidates, fetches the full grant record, optionally checks SAM.gov registration status, and returns pass / fail / uncertain with reasons grounded in the eligibility text.
  3. Drafter. For the highest-ranked candidate that passed (or the highest "uncertain" if none passed cleanly), produces a structured application skeleton — section headings, per-section guidance, and applicant-prompts. Plus a watch-outs list of pitfalls grounded in the grant's eligibility.

Three sub-agents, dispatched by a planner, composed into a transcript with provenance — which model answered each turn, how long it took, what it cost. Total cost per run: about five cents. Total wall time: about a minute.

Why three sub-agents, not five

The temptation is to add a "budget builder", a "compliance reviewer", a "prior-art search". Each of those is plausible. None of them is necessary to demonstrate the shape — sub-agent orchestration plus tool selection plus failure recovery. Adding them would be framework creep that dilutes the headline.

Three is the smallest count that proves the shape. Discovery uses one tool (search). Eligibility uses two (detail-fetch + SAM lookup) and exercises a hard-gate hoist where a deterministic SAM check overrides the model's verdict. Drafter uses one (detail-fetch) and exercises the most opinionated design choice in the project: it doesn't write prose.

The drafter doesn't write prose

This is the call I expect to get the most pushback on. "AI writes your grant application" is an easier marketing line. It would also be a worse product.

Federal grant applications need verifiable claims and applicant-specific voice. The agent will hallucinate both. So the drafter emits a structured skeleton — sections, guidance per section, questions the applicant must answer — instead of pretending it can speak for the organization. From a real run for the Arizona construction firm:

# Project Approach
  ↳ Construction activities, sequence, Davis-Bacon compliance plan...
  - What construction activities will you perform, and in what sequence?
  - What codes, standards, or federal requirements (Davis-Bacon Act,
    Buy American, environmental review) govern this project?
  - How will you manage subcontractors, and what portion of the work
    will your 12-person team self-perform versus sub out?

watch-outs:
  ! For-profit construction firms must apply through a state/local lead
    applicant — confirm partnership before further investment.
  ! SAM.gov registration must be active at submission AND throughout
    the period of performance.
  ! Davis-Bacon prevailing wage compliance is mandatory on federally-
    funded construction.

That's the shape AI is actually good at — compressing the reading work, surfacing what a NOFO asks for, structuring the response. Honest about what it isn't doing.

The hard gate the model can't override

If the user profile contains a UEI, the eligibility sub-agent checks SAM.gov registration status. If the registration is anything other than active, that fact gets prepended to the verdict's blockers regardless of what the model said.

Why bypass the model? Because federal grants categorically do not award to unregistered entities. Letting the model "decide" creates a path where it answers pass despite a fatal disqualifier. The hoist makes the constraint structural rather than emergent. It's the only place in the system where a deterministic check overrides the verdict — and that's exactly the kind of thing that should bypass the LLM, not depend on it.

How prompt design and external APIs are coupled

The first version of the discovery prompt produced queries like:

"commercial construction infrastructure small contractor Arizona Phoenix metro"

Zero useful hits.

grants.gov's keyword index does strict AND-matching — every term has to appear in the opportunity title or synopsis. Long queries get zero matches. After tightening the prompt to emit 2–4 broad nouns and explicitly drop geography (handled at eligibility time, not in the keyword index), the same intent produced a real shortlist:

52  PWEAA2023      — FY 2025 EDA Public Works and Economic Adjustment Assistance
45  DHS-25-MT-047  — FEMA Building Resilient Infrastructure and Communities (BRIC)
38  GR-RDC-25-001  — RESTORE Act Direct Component
35  HE125426R5001  — Military-Connected Schools Construction
30  VA-GRANTS-...  — State Veterans Home Construction Grant Program

The lesson, again: prompt design and external-API behavior are coupled. You can't tune one without understanding the other. The "AI part" is not separable from the "API part".

Why the planner never throws

Sub-agent failures, search-API errors, JSON parse failures — every one of them surfaces as a structured TranscriptStep entry the renderer and the recorder both consume. The planner has zero try/catch blocks at the call-site level. Errors are values, not exceptions.

This is what production-shape error handling looks like in agent code. The hosted demo doesn't have to wrap calls in try/catch. The recording layer doesn't have to handle partial JSON. The eval scorer can pattern-match on result.kind === "error" without inspecting types. Failures compose; exceptions don't.

The hosted demo, the custom-intent path, and the $3/day cap

I want strangers to be able to run this agent without cloning anything, registering for keys, or trusting me with their credentials. So there's a hosted demo: pick a preset intent or write your own funding-need description, watch the transcript stream in real time, see the same provenance the local CLI shows.

But the moment a public website hits an LLM provider, it's a cost surface. So the demo is hardened in four layers:

  • 5 preset intents — verified, recorded, and each has a fallback recording when budget runs out.
  • Custom intent + structured custom profile. Visitors describe their own scenario (20–600 chars) and fill in their own profile — NAICS code, state, ZIP, employee count, annual revenue, ownership designations, entity type, years in operation. Every structured field is bounded by an enum or a regex or a number range, so the injection surface is identical to the preset case. The two free-text fields (intent and an optional mission description) are length-capped and filtered for jailbreak phrases before any model call.
  • Per-IP rate limit. Five runs per hour via Upstash. Enough headroom for a curious visitor; not enough for script-driven abuse.
  • Daily budget cap with transparent live readout. $3/day, ~45 demo runs. The page shows the running total and color-codes the bar (green → yellow → red) so visitors know what state they're in. Preset intents fall back to the recorded run when the cap is hit; custom intents return a 503 with a banner explaining the cap.

None of these guardrails are flashy. All of them are the kind of thing that has to be there before a public AI demo can responsibly stay public — and the visible budget pill is the kind of trust signal hosted AI demos almost never bother with.

Composition with the prior two projects

The fallback ladder in src/agent/fallback-ladder.ts is ported directly from fedbench. Sonnet 4.6 primary, Haiku 4.5 fallback, with provenance returned on every call so the transcript can show which rung answered. The MCP-style tool registry mirrors the fieldops-mcp tool template — same { name, description, input_schema } shape so a reviewer who's read fieldops-mcp recognizes the pattern instantly. The recording layer (bun run demo reads the JSONL with no API key needed) is fedbench's pattern, transplanted.

Three repos, composed on purpose. Each one demonstrates one shape; reading all three shows a developer who builds in patterns rather than reinventing per project.

How these skills transfer

Federal grants are the example. The shape applies anywhere a buyer's workflow is bureaucratic, multi-step, and grounded in real systems of record. Real bottlenecks this shape addresses:

  • Bureaucratic workflows that buyers don't have time to navigate themselves. Tax filings, compliance audits, healthcare prior auths, B2B procurement, immigration, insurance claims. The planner-with-sub-agents shape compresses days of reading into a single transcript with a recommended next step — and shows the work, so the buyer can verify before acting.
  • Production agents going over budget. Every public AI feature is a cost surface the moment it ships. The daily-cap counter with graceful replay fallback is the pattern that turns "we shut the demo down at 11am because someone went viral" into "we serve a recorded run with a banner and the metric stays under cap." Generalizes to any per-team or per-tenant cost cap.
  • Prompt-injection on free-text customer fields. Anywhere a customer types into an agent — support chat, intake forms, "describe your situation" textareas — the structured fields are the injection surface. Bounded enums + length caps + heuristic pre-filters neutralize that surface without giving up the ability to take real input.
  • Fragile multi-step agents. The planner never throws — sub-agent failures, API errors, JSON parse failures all become structured TranscriptStep entries. That's what production agent code actually looks like once the demo gets real traffic. Errors are values you can route on; exceptions kill the request.
  • Enterprise compliance officers asking for hard gates. SAM.gov registration is a categorical disqualifier here; legal has equivalents in every regulated vertical. A deterministic check that overrides the LLM verdict is the difference between an agent that occasionally lies to compliance and an agent that compliance approves to ship.

These are the conversations a federal-grants demo opens that an abstract "I built an agent" pitch can't.

What the trilogy proves

fedbench proves I can measure whether an agent is right. fieldops-mcp proves I can shape what the agent can do at all. grant-pilot proves I can compose those into a working multi-turn workflow that fails honestly, falls back gracefully, and lives publicly without lighting money on fire.

That's the Forward Deployed shape. Three artifacts, public, MIT-licensed, with sample transcripts and live demos. Read the source. The proof is in the code.

Why I built this one, and what I'm hoping it starts

A lot of the work I want to be doing more of in the next few years sits at the boundary between a real customer's workflow and the agent that helps them run it. The names attached to that work are familiar — Forward Deployed Engineer, Applied AI, Solutions Engineer at an AI-native company, applied-AI practice work inside a consulting firm. The shared skill is the one I find most interesting in 2026: composing tools, sub-agents, and evals into a multi-turn system that does honest work for a real user. Picking when to fall back, when to refuse, when to surface uncertainty, where the human stays in the loop. Most of the value an agent delivers in production is decided in those choices, before it touches a model.

If you're working on agent-shaped problems — your own product, your own team, anywhere in your network, inside Deloitte's AI practice or at any of the AI-native companies building this kind of system — I'd genuinely enjoy a conversation. The version I find most useful is usually the smallest: one specific workflow, what you composed, what you almost shipped and pulled back. I'd rather swap notes on what's actually working than trade abstractions about agents in general.

Easiest way to reach me is the contact page on this site, or just connect on LinkedIn. The repo is at github.com/midimurphdesigns/grant-pilot, the live demo is at grant-pilot.kevinmurphywebdev.com, and the docs in there go deeper than this post does — an architecture overview, an ADR log, design notes, and a memory directory of the calls and quirks the system surfaced during build.