May 10, 202615 minEngineering

Building grant-pilot: a multi-turn agent that orchestrates three sub-agents over real federal data

A multi-turn agent that finds federal grants for small businesses and nonprofits. A planner orchestrates three sub-agents (discovery, eligibility, drafter) over real federal data, with a hard daily budget cap, atomic spend reservation, and a planner that never throws. Here's what I built and what shipping it taught me.

Repo: github.com/midimurphdesigns/grant-pilot

Live demo: grant-pilot.kevinmurphywebdev.com

Companion posts: fedbench (eval rigor) and fieldops-mcp (agent tool design).

grant-pilot is the third of three companion artifacts with fedbench and fieldops-mcp. fedbench is about measuring whether an agent is right. fieldops-mcp is about shaping what the agent can do at all. grant-pilot is what those look like composed: an agent that has to actually run a workflow, multi-turn, with real APIs, with structured failures, with a hosted demo strangers can run without trusting me with credentials.

What it does

A small business or nonprofit picks one of five intents (e.g. "I run a 12-person construction firm in Arizona, what infrastructure-related grants might fit?"). The agent does three things a grants consultant does:

Discovery. Derives a keyword query from the intent, runs a federal-grants keyword search, ranks 5 candidates 0–100 with one-line rationale per candidate.
Eligibility. For the top 3 candidates, fetches the full grant record, optionally checks the applicant's SAM registration status against the federal entity-registration system, and returns pass / fail / uncertain with reasons grounded in the eligibility text.
Drafter. For the highest-ranked candidate that passed (or the highest "uncertain" if none passed cleanly), produces a structured application skeleton: section headings, per-section guidance, and applicant-prompts. Plus a watch-outs list of pitfalls grounded in the grant's eligibility.

Three sub-agents, dispatched by a planner, composed into a transcript with provenance: which model answered each turn, how long it took, what it cost. Total cost per run: about five cents. Total wall time: about a minute.

Why three sub-agents, not five

The temptation is to add a "budget builder", a "compliance reviewer", a "prior-art search". Each of those is plausible. None of them is necessary to demonstrate the shape: sub-agent orchestration plus tool selection plus failure recovery. Adding them would be framework creep that dilutes the headline.

Three is the smallest count that proves the shape. Discovery uses one tool (search). Eligibility uses two (detail-fetch + SAM lookup) and exercises a hard-gate hoist where a deterministic SAM check overrides the model's verdict. Drafter uses one (detail-fetch) and exercises the most opinionated design choice in the project: it doesn't write prose.

The drafter doesn't write prose

This is the call I expect to get the most pushback on. "AI writes your grant application" is an easier marketing line. It would also be a worse product.

Federal grant applications need verifiable claims and applicant-specific voice. The agent will hallucinate both. So the drafter emits a structured skeleton (sections, guidance per section, questions the applicant must answer) instead of pretending it can speak for the organization. From a real run for the Arizona construction firm:

# Project Approach
  ↳ Construction activities, sequence, Davis-Bacon compliance plan...
  - What construction activities will you perform, and in what sequence?
  - What codes, standards, or federal requirements (Davis-Bacon Act,
    Buy American, environmental review) govern this project?
  - How will you manage subcontractors, and what portion of the work
    will your 12-person team self-perform versus sub out?

watch-outs:
  ! For-profit construction firms must apply through a state/local lead
    applicant. Confirm partnership before further investment.
  ! SAM.gov registration must be active at submission AND throughout
    the period of performance.
  ! Davis-Bacon prevailing wage compliance is mandatory on federally-
    funded construction.

That's the shape AI is actually good at: compressing the reading work, surfacing what a NOFO asks for, structuring the response. Honest about what it isn't doing.

The hard gate the model can't override

If the user profile contains a UEI, the eligibility sub-agent checks the applicant's SAM registration status. If the registration is anything other than active, that fact gets prepended to the verdict's blockers regardless of what the model said.

Why bypass the model? Because federal grants categorically do not award to unregistered entities. Letting the model "decide" creates a path where it answers pass despite a fatal disqualifier. The hoist makes the constraint structural rather than emergent. It's the only place in the system where a deterministic check overrides the verdict, and that's exactly the kind of thing that should bypass the LLM, not depend on it.

How prompt design and external APIs are coupled

The first version of the discovery prompt produced queries like:

"commercial construction infrastructure small contractor Arizona Phoenix metro"

Zero useful hits.

The federal-grants keyword index does strict AND-matching: every term has to appear in the opportunity title or synopsis. Long queries get zero matches. After tightening the prompt to emit 2 to 4 broad nouns and explicitly drop geography (handled at eligibility time, not in the keyword index), the same intent produced a real shortlist:

52  PWEAA2023      FY 2025 EDA Public Works and Economic Adjustment Assistance
45  DHS-25-MT-047  FEMA Building Resilient Infrastructure and Communities (BRIC)
38  GR-RDC-25-001  RESTORE Act Direct Component
35  HE125426R5001  Military-Connected Schools Construction
30  VA-GRANTS-...  State Veterans Home Construction Grant Program

The lesson, again: prompt design and external-API behavior are coupled. You can't tune one without understanding the other. The "AI part" is not separable from the "API part".

Why the planner never throws

Sub-agent failures, search-API errors, JSON parse failures: every one of them surfaces as a structured TranscriptStep entry the renderer and the recorder both consume. The planner has zero try/catch blocks at the call-site level. Errors are values, not exceptions.

This is what production-shape error handling looks like in agent code. The hosted demo doesn't have to wrap calls in try/catch. The recording layer doesn't have to handle partial JSON. The eval scorer can pattern-match on result.kind === "error" without inspecting types. Failures compose; exceptions don't.

The hosted demo, the custom-intent path, and the $3/day cap

I want strangers to be able to run this agent without cloning anything, registering for keys, or trusting me with their credentials. So there's a hosted demo: pick a preset intent or write your own funding-need description, watch the transcript stream in real time, see the same provenance the local CLI shows.

But the moment a public website hits an LLM provider, it's a cost surface. So the demo is hardened in four layers:

5 preset intents. Verified, recorded, and each has a fallback recording when budget runs out.
Custom intent + structured custom profile. Visitors describe their own scenario (20–600 chars) and fill in their own profile: NAICS code, state, ZIP, employee count, annual revenue, ownership designations, entity type, years in operation. Every structured field is bounded by an enum or a regex or a number range, so the injection surface is identical to the preset case. The two free-text fields (intent and an optional mission description) are length-capped and filtered for jailbreak phrases before any model call.
Per-IP rate limit. Five runs per hour via Upstash. Enough headroom for a curious visitor; not enough for script-driven abuse.
Daily budget cap with transparent live readout. $3/day, ~45 demo runs. The page shows the running total and color-codes the bar (green to yellow to red) so visitors know what state they're in. Preset intents fall back to the recorded run when the cap is hit; custom intents return a 503 with a banner explaining the cap.

None of these guardrails are flashy. All of them are the kind of thing that has to be there before a public AI demo can responsibly stay public, and the visible budget pill is the kind of trust signal hosted AI demos almost never bother with.

Composition with the prior two projects

The fallback ladder in src/agent/fallback-ladder.ts is ported directly from fedbench. Sonnet 4.6 primary, Haiku 4.5 fallback, with provenance returned on every call so the transcript can show which rung answered. The MCP-style tool registry mirrors the fieldops-mcp tool template: same { name, description, input_schema } shape so a reviewer who's read fieldops-mcp recognizes the pattern instantly. The recording layer (bun run demo reads the JSONL with no API key needed) is fedbench's pattern, transplanted.

Three repos, composed on purpose. Each one demonstrates one shape; reading all three shows a developer who builds in patterns rather than reinventing per project.

Migration to Vercel AI SDK, and what changed

grant-pilot shipped on the raw Anthropic SDK. After the first demo ran end to end and the recordings looked right, I ported the agent stack to Vercel's AI SDK. The blog post you're reading was already drafted; I'm appending this section rather than rewriting because the order matters. Ship the substance first. Port to the abstraction when the case for it is clear. Both are defensible. The order is part of the story.

The case for the port, in priority order:

generateObject replaces hand-rolled JSON parsing. The old discovery and eligibility sub-agents lived inside a four-step ritual: write a system prompt that begs the model to "output JSON only, no markdown fences", call the model, strip fences with a regex, run JSON.parse, then Schema.parse. Four chances for the parse to fail silently or noisily. generateObject({ schema, system, prompt, maxRetries: 2 }) collapses that into one call. The same Zod schemas plug in directly. The SDK constrains the model's output to match, validates on completion, and retries on schema failure inside the same request. The two sub-agents lost their parseJsonLoose helper and read about thirty percent shorter as a result.
streamObject unlocks progressive UI for any structured output. Drafter was the first sub-agent to migrate because the value was loudest there: section headings, guidance text, applicant prompts, and watch-outs filling in one block at a time felt right. But the same trick applies to Discovery's ranked shortlist and Eligibility's verdict. Both now use streamObject({ schema, system, prompt }), which exposes a partialObjectStream of DeepPartial<schema> updates. The endpoint emits a discovery-partial / eligibility-partial / draft-partial NDJSON event per update; the transcript renders streaming preview tiles that the final tiles replace once each sub-agent finishes. Same structured contract, no two-call cost, and the perceived latency drops from "six-second spinner" to "the answer is forming in front of me."
Provider abstraction is real value even with one provider. src/provider.ts is now a five-line module exporting anthropic = createAnthropic({ apiKey }). Every model call in the agent stack reaches through that single source of truth. If I wanted OpenAI tomorrow, I'd add import { openai } from "@ai-sdk/openai" and change one line. The agent code is identical. That's not a feature I need today, but the structural property of "provider choice is decided in one place" is the right shape regardless of whether I exercise it.
It's what Vercel customers use. This is the unsentimental reason. The AI SDK is the canonical primitive on Vercel's platform. A product engineer walking into a customer codebase on Vercel will see this stack, not the raw Anthropic SDK. Building against it now means the patterns transfer cleanly into customer work.

Three streaming primitives, three jobs

The temptation, after migrating, is to use streamObject everywhere. I didn't. Each of the AI SDK's structured-output primitives is doing a different job in this codebase, and the choice between them is one of the more honest design questions in agent UX.

streamText with tools and maxSteps runs the Drafter sub-agent's prose tier. Drafter writes English sentences inside structured fields and may call grant_detail mid-generation to re-check the opportunity record. streamText is the only primitive that streams tokens and supports tool-use loops in the same call. Wrong primitive for structured output without prose; right primitive when prose is what's running through it.
streamObject with a Zod schema runs Discovery's ranking step and Eligibility's verdict step. Both have structured outputs whose individual entries (a ranked candidate, a reason, a blocker) are independently meaningful before the full object lands. The user gains real signal from watching the shortlist build entry-by-entry rather than waiting six seconds for the whole list to drop. The schema still constrains the final shape; the partials stream against it.
generateObject runs Discovery's cheap derive-query step: a two-field object (keyword, rationale) that resolves in about a second on small output. Streaming a two-field object would deliver no perceived-latency win and adds complexity in the consumer. Plain generateObject is the right default; streaming is the upgrade you pay for when the user can see the answer forming.

The shape of that decision is what I find most interesting about working in this layer. The SDK gives you three knobs that look interchangeable in the docs and are not interchangeable in practice. Picking the right knob for each call is most of the job.

What stayed: the code-based planner. The "deterministic boundary that never throws, routes failures as values" story is central to this project, and I want it audible. The planner orchestrates three discrete sub-agent calls; the LLM doesn't drive the orchestration. The AI SDK's streamText({ tools, maxSteps }) lets you build an LLM-driven planner with native tool-use loops, and that's the right primitive for some products. It's the wrong primitive for one where determinism at the orchestration layer is the whole pitch. Recognizing when not to use a feature is part of the work.

The fallback ladder stayed too, just with a different call shape. The old signature took an Anthropic client and a system+user prompt; the new one takes a callback parametric on the AI SDK's LanguageModel type. Sub-agents pass in their own generateObject or streamObject call; the ladder owns the retry-on-overloaded policy and cost math. Same fallback rungs (Sonnet 4.6 primary, Haiku 4.5 fallback). Same cost-per-million-tokens math. Different surface.

One real bug the migration surfaced. On one of the five preset intents, both ladder rungs occasionally fail with "No object generated: response did not match schema" because the model's output sometimes violates the schema in a way generateObject rejects. The old code was probably accepting the same kind of malformed output silently and recovering through parseJsonLoose's leniency. The new code fails loudly and routes the failure as a structured value to the planner, which surfaces it as a "discovery failed" decision step rather than crashing. That's the agentic-correctness story this project has been telling. Better to fail visibly than succeed accidentally.

The recordings were re-captured against the new stack. Total spend across all five intents: about thirty-six cents.

How these skills transfer

Federal grants are the example. The shape applies anywhere a buyer's workflow is bureaucratic, multi-step, and grounded in real systems of record. Real bottlenecks this shape addresses:

Bureaucratic workflows that buyers don't have time to navigate themselves. Tax filings, compliance audits, healthcare prior auths, B2B procurement, immigration, insurance claims. The planner-with-sub-agents shape compresses days of reading into a single transcript with a recommended next step, and shows the work, so the buyer can verify before acting.
Production agents going over budget. Every public AI feature is a cost surface the moment it ships. The daily-cap counter with graceful replay fallback is the pattern that turns "we shut the demo down at 11am because someone went viral" into "we serve a recorded run with a banner and the metric stays under cap." Generalizes to any per-team or per-tenant cost cap.
Prompt-injection on free-text customer fields. Anywhere a customer types into an agent (support chat, intake forms, "describe your situation" textareas) the structured fields are the injection surface. Bounded enums + length caps + heuristic pre-filters neutralize that surface without giving up the ability to take real input.
Fragile multi-step agents. The planner never throws: sub-agent failures, API errors, JSON parse failures all become structured TranscriptStep entries. That's what production agent code actually looks like once the demo gets real traffic. Errors are values you can route on; exceptions kill the request.
Enterprise compliance officers asking for hard gates. SAM registration status is a categorical disqualifier here; legal has equivalents in every regulated vertical. A deterministic check that overrides the LLM verdict is the difference between an agent that occasionally lies to compliance and an agent that compliance approves to ship.

These are the conversations a federal-grants demo opens that an abstract "I built an agent" pitch can't.

What the three companion artifacts prove

fedbench shows I can measure whether an agent is right. fieldops-mcp shows I can shape what the agent can do at all. grant-pilot shows I can compose those into a working multi-turn workflow that fails honestly, falls back gracefully, and lives publicly without lighting money on fire.

Three public, MIT-licensed artifacts, with sample transcripts and live demos. Read the source. The proof is in the code.

Why I built this one, and what I'm hoping it starts

A lot of the work I want to be doing more of in the next few years sits at the boundary between a real customer's workflow and the agent that helps them run it. The skill I find most interesting in 2026 is composing tools, sub-agents, and evals into a multi-turn system that does honest work for a real user. Picking when to fall back, when to refuse, when to surface uncertainty, where the human stays in the loop. Most of the value an agent delivers in production is decided in those choices, before it touches a model.

If you're working on agent-shaped problems (your own product, your own team, anywhere in your network, inside Deloitte's AI practice or at any of the AI-native companies building this kind of system) I'd genuinely enjoy a conversation. The version I find most useful is usually the smallest: one specific workflow, what you composed, what you almost shipped and pulled back. I'd rather swap notes on what's actually working than trade abstractions about agents in general.

Easiest way to reach me is the contact page on this site, or just connect on LinkedIn. The repo is at github.com/midimurphdesigns/grant-pilot, the live demo is at grant-pilot.kevinmurphywebdev.com, and the docs in there go deeper than this post does: an architecture overview, an ADR log, design notes, and a memory directory of the calls and quirks the system surfaced during build.

ask kev-o