May 22, 202620 minEngineering

Building Forge: a multi-agent debugging concierge in production-grade TypeScript

Point Forge at a stack trace. Four specialist subagents fan out in parallel, each with a focused tool set, and a coordinator merges their structured findings into ranked hypotheses with calibration-weighted confidence. Built on the Vercel AI SDK plus Anthropic. Here's what I built, what I learned about parallel agent orchestration, and what I would change for production.

Repo: github.com/midimurphdesigns/forge

Live: forge.kevinmurphywebdev.com

Skills exemplified

For the hiring manager skimming this post: every bullet below maps to a layer in Forge that the live demo shows in action.

Multi-agent orchestration with the Vercel AI SDK. Parallel fan-out via Promise.all plus pLimit bounded concurrency, four specialist subagents with focused tool sets, structured discriminated-union outcomes.
Two-pass agent pattern for reliable structured output from a tool-using loop (generateText then generateObject).
Durable session state across serverless replicas. UUID in URL, Upstash-backed store, work-vs-transport separation so refresh resumes from a snapshot without re-running the agents.
Preemptive abort plumbed end-to-end via AbortSignal. UI click flips an Upstash flag, the coordinator polls cross-instance, an in-process AbortController cancels the in-flight Anthropic fetch.
Brier-score calibration as a feedback loop. Every (predicted confidence, rubric outcome) pair gets logged, weights derive from mean outcome divided by mean predicted with a clamp, applied at merge time.
Anthropic prompt caching with cache-control breakpoints on system messages, instrumented with per-call USD pricing and cache hit rate in the cost dashboard.
Graded eval rubric with intersection-over-union scoring for line ranges, n-runs aggregation, and per-scenario mean plus standard deviation for statistical significance.
Speculative tool-input prefetch. Predict the next tool call based on the current one and fire it in parallel with the LLM's thinking time.
Production-shaped guardrails. Upstash sliding-window rate limit per IP, daily USD cap shared across all visitors, owner-bypass cookie.
TypeScript discipline. Strict mode, zero any, discriminated unions driving exhaustive UI rendering, typed cross-process state contracts.

Forge is a multi-agent debugging concierge. Paste a stack trace. Four specialist subagents run in parallel, each with a focused tool set and a tight system prompt. A coordinator merges their structured findings into ranked hypotheses, weights confidence by per-lane Brier calibration, and streams progress events to the browser as the work unfolds. Every refresh resumes the in-flight session. Every stop button cancels the underlying LLM call mid-stream.

I built Forge to learn three things in anger: parallel agent orchestration with the Vercel AI SDK, durable session state across serverless replicas, and the gnarly real questions that come up when you try to make an agent system trustworthy at scale.

This post is the architectural walkthrough, the honest tradeoffs, and the list of what I would build differently if a real production customer adopted this tomorrow.

What is on screen

Open the live URL and you see one page. A short visitor header explains what is going on. Below that, a single button kicks off an investigation against a hardcoded sample input (a TypeError stack trace plus a deployed commit SHA plus an error timestamp plus a fingerprint, visible behind a disclosure toggle).

Click run. Four lane cards transition from queued to running to done, in parallel, in roughly three to eight seconds each. As lanes finish, the coordinator merges results into ranked hypotheses with confidence percentages. A calibration panel shows the per-lane Brier scores and the weights derived from them. A cost panel breaks down input tokens, output tokens, cache reads, and USD per lane. An eval harness section renders a representative snapshot from the CLI runner.

That is the surface. The interesting part is what runs underneath.

The four specialists

Forge has one design rule that pays for itself across the whole codebase: specialize, do not generalize. A naive agent gets a stack trace, the full repo, the full error log, and a giant prompt that says "figure it out." Real-world agents fail at this because the model has to context-switch between four reasoning modes, the prompt gets bloated with everything it might need, and the output is one monolithic answer that is hard to grade.

Forge splits the work across four subagents, each with one job, one focused tool set, and a structured output contract.

source-reader identifies the implicated source files from the stack trace and reads the code at the relevant lines. Its tools are fetch_file and fetch_directory. Its output is a typed object with file path, line range, snippet, surrounding context, confidence, and reasoning.

blame-correlator finds recent commits that could have caused the error and ranks them by relevance. Its tools are git_log, git_diff, and git_blame. Its output is a list of candidate commits with relevance scores plus a single top suspect plus an aggregate confidence.

frequency-analyzer quantifies the blast radius. How often does this error fire, how many users hit it, is it spiking. Its tools are error-tracking queries. Its output is a structured severity report with a p0 through p3 classification.

repro-drafter writes a minimal local reproduction from the stack trace alone. Its tool set is empty by design. Its output is numbered steps, optional code, environment requirements, and an honest list of gaps in its assumptions.

The design contract for these lives in docs/AGENTS.md on the repo. I wrote that doc before writing any of the agent code. The point of writing it first was to be able to re-explain the system to an interviewer six months later by reading the doc, closing it, and reconstructing the architecture from words.

Why parallel beats sequential

The naive way to build this is a sequential agent loop. source-reader runs, its output flows into blame-correlator, which flows into frequency-analyzer, which flows into repro-drafter. That structure has two real problems.

The first is latency. Four 3-second calls run sequentially equals 12 seconds. In parallel they are ~3 seconds. User-facing latency is the headline win.

The second is more subtle and more important. Sequential reasoning means each agent's hypothesis contaminates the next agent's framing. If source-reader concludes "this is a null deref in auth.ts," blame-correlator anchors on that and stops considering "maybe the stack trace is misleading because of a sourcemap mismatch." Parallel reasoning lets each lane form an independent hypothesis. The coordinator then merges with a calibration-aware weighting. This is the same independence-of-evidence trick that ensemble methods in classical ML use.

Forge implements this with a single fan-out in lib/coordinator.ts:

const limit = pLimit(4);

const outcomes = await Promise.all(
  LANES.map((lane) =>
    limit(async (): Promise<LaneOutcome> => {
      try {
        const value = await lane.run(input, sessionId, controller.signal);
        return { lane: lane.name, status: "fulfilled", value, durationMs };
      } catch (err) {
        return { lane: lane.name, status: "rejected", reason, durationMs };
      }
    }),
  ),
);

Two small things in that snippet are doing real work. First, pLimit(4) is a bounded-concurrency primitive. It caps how many promises run at once within a single request. For four lanes that is not the win it sounds like; the real win is the pattern. In a production version with twenty lanes you would set pLimit to four or five and the fan-out would run in waves.

Second, every lane catches its own errors and returns a typed LaneOutcome discriminated union. That is why I use Promise.all instead of Promise.allSettled. The outer promise never rejects because the inner functions never throw. The coordinator gets back a typed array of outcomes, each one either fulfilled with a result or rejected with a reason. No mixed-rejection batch semantics, no try-catch wrapping the merge logic. The interview-shaped articulation is "each lane already returns success or failure as data, so the outer promise never sees a rejection."

The two-pass agent pattern

Every LLM call has one job that defines its output shape. Either it loops with tools and produces unstructured text, or it produces a typed JSON object with no tools. Trying both in one call fails two ways. The model either skips tools and hallucinates the answer, or calls tools correctly and produces schema-invalid JSON. The Vercel AI SDK actually enforces this by refusing the tools argument on generateObject.

Forge's subagents use a two-pass pattern. Pass one is generateText with the tool set and a step-count stop condition, producing a free-form investigation transcript. Pass two is generateObject with no tools and a Zod schema, coercing the transcript into the typed result shape.

const investigation = await generateText({
  model: anthropic(MODEL),
  messages: [...],
  tools: { fetch_file, fetch_directory },
  stopWhen: stepCountIs(6),
  abortSignal: signal,
});

const { object } = await generateObject({
  model: anthropic(MODEL),
  schema: Schema,
  prompt: `Investigation transcript:\n\n${investigation.text}\n\nProduce the structured result.`,
  abortSignal: signal,
});

Two calls per lane, four lanes, equals eight LLM calls per investigation. The cost dashboard in the live demo shows roughly $0.08 per full run on Claude Sonnet 4.6. That is the price of reliable structured output from a tool-using loop. The alternative, fighting the model to produce both at once, fails too often to be useful in production.

Resumable streams

Forge's investigation is a first-class durable session. Every state transition writes to a session store keyed by a UUID. The browser captures the UUID from the first SSE frame and pins it to the URL via history.replaceState. Refresh the tab and the page reads ?sessionId=X, fires a GET to /api/debug?sessionId=X, and the server replays the buffered lane state plus the merged hypotheses.

The trick worth knowing here is what crosses the refresh boundary and what does not. Three things cross.

The UUID in the URL. Persisted in the browser's address bar across reloads.

The session state on the server. A map of lane statuses, results, durations, plus the merged hypotheses, plus the cost summary. Lives in process memory in dev. Should live in Upstash or KV in production.

The work itself. The original POST's runCoordinator keeps running regardless of whether the client is connected. Two browsers can resume the same UUID simultaneously and both see the same snapshot. They do not interfere with each other because the GET handler is read-only.

What does NOT cross is the live tail. The resume GET returns a snapshot of the state at the moment of the GET, not a subscription to future updates. If you want to watch the still-running lanes continue, you need to refresh again, or build an event bus the POST publishes to and the GET subscribes to. I deferred that layer. Snapshot-on-refresh is enough for the demo's use case and the architectural pattern is what matters.

The architectural shape to internalize is the separation of work and transport. The work (the agent loop) runs to completion on the server regardless of whether anyone is listening. The transport (the HTTP response stream) is just one possible subscriber to the work's progress. Resume reconnects the transport to a different snapshot of the same work, not to the work itself.

Per-lane interrupt and the signal-vs-actuator distinction

The stop button on each lane card was the bug that taught me the most about serverless.

Version one was cooperative abort. The interrupt POST set an in-memory flag, and the coordinator checked the flag at lane-task boundaries (before start, after lane.run returns). The flag worked perfectly on localhost. The same fan-out runs in one Node process; the interrupt POST and the coordinator share heap memory; the flag flip is visible everywhere.

On Vercel it failed silently. The investigation kept running for 30 seconds after I clicked stop, then completed normally. Vercel's serverless runtime can route consecutive requests from the same browser to different replicas. The POST that started the investigation landed on instance A; the interrupt POST landed on instance B. Instance B's in-memory store was empty (different process, different heap), the session lookup returned null, the interrupt returned 404 before flipping anything, and instance A's coordinator never knew it was supposed to stop.

The fix is the lesson worth repeating. The signal is global; the actuator is local.

The interrupt POST writes the abort signal to Upstash. The coordinator (running on a different replica, possibly) polls Upstash every 500ms and, when it sees the flag set, calls controller.abort() on its own local AbortController. The signal crosses processes via Redis. The actuator (the AbortController, with its signal reference and abort() method) is a JavaScript object that lives in heap memory in exactly one process and can only cancel work running in that same process. You cannot serialize a controller, you cannot ship it to another instance. So you signal across processes and actuate within them.

Then I plumbed AbortSignal end-to-end. Every subagent function accepts an optional signal and passes it through to both the generateText and generateObject calls via the AI SDK's abortSignal parameter. When the controller aborts, the underlying fetch to Anthropic closes its connection, the SDK promise rejects with AbortError, the coordinator catches it, distinguishes abort-shaped errors from real errors, and marks the lane aborted.

Round-trip from click to "aborted" badge is roughly one second on Vercel and effectively instant on localhost. The UI flips to an optimistic "stopping" state on click so the button feels responsive before the server-side abort lands.

What this layer actually demonstrates is that real cancellation in a distributed agent system requires three separate things working together. First, a shared signal: any process can read or write the abort intent, which is why it lives in Upstash. Second, a local actuator: each running process owns an AbortController that can cancel work happening inside it but nowhere else. Third, signal-awareness all the way down: every layer of the call stack, including the HTTP request the AI SDK fires to Anthropic, has to accept and forward the signal so the cancellation reaches the actual work. Miss any one of these and the stop button is decorative.

Brier calibration and self-correction

LLMs are systematically overconfident. Source-reader will tell you it is 95% sure the bug is in src/auth/session.ts even when it is wrong. If the merge logic naively averages those overconfident self-ratings, the system's overall confidence is also overconfident, and a hiring manager rightly stops trusting the output.

Forge's calibration layer measures and corrects.

Every (predicted confidence, rubric outcome) pair is logged to Upstash after each session. The rubric scores each lane's structured output against a per-component grading scheme. For source-reader it is 40 points for file match, 20 points scaled by line-range intersection-over-union with the ground truth, 15 points for having a snippet, 25 points for having reasoning. The threshold for "this lane counted as correct" is total >= 60% of max. That threshold maps to "useful answer, not perfect answer," which is what a real reviewer applies when deciding whether to act on a hypothesis.

Across many runs, two numbers per lane are computed.

Brier score is the mean squared error between predicted confidence and binary outcome. Lower is better. Zero is perfect calibration. 0.25 is the no-information baseline (the score you get from predicting 0.5 on everything). Above 0.25 is worse than random and means the lane is actively misleading.

Weight is the ratio of mean outcome to mean predicted, clamped to the range 0.5 to 1.5 with a three-sample floor before weighting activates. A chronically overconfident lane sees its weight drop toward 0.5. A chronically underconfident lane sees its weight rise toward 1.5. The clamp prevents small-sample overcommitment in either direction.

At merge time, each lane's contributed confidence is multiplied by its weight before ranking. A lane with weight 0.6 has 40% less influence on the merged hypothesis than a lane with weight 1.0. The system gets more reliable hypotheses over time even though no individual lane improves.

The principle I want to lock in for interviews is calibration is system-level learning, not model-level learning. The lanes are stateless function calls. The calibration log is the system's accumulated memory of which lanes to trust more or less. The lanes never get smarter. The merge gets smarter. That separation is the architectural win.

The honest gap I documented in the UI: real production calibration needs a real correctness oracle. Forge's rubric is a stub that knows the right answer for the demo's sample input. For arbitrary scenarios you need either labeled eval data (the harness's golden set) or implicit signals like user thumbs-up or bug-reopened-within-24-hours. Neither is free. The blog post is the place to be honest about this; the demo page now says it in plain text.

Speculative tool-input prefetch

There is one more orchestration layer worth describing. The speculator pre-fires likely next tool inputs in the background while the LLM is still reasoning about the current tool's output.

When source-reader calls fetch_directory("src/auth"), a rule fires that says "the next call is almost certainly fetch_file("src/auth/session.ts") against the same SHA." The speculator immediately calls the file fetcher and stores the resulting promise in a cache. By the time the LLM actually requests the file, the result is sitting in the cache and the consume call returns instantly. On a real GitHub API where each call takes 800ms, this saves real latency.

Tracked metrics on the live demo show predictions, hits, misses, and in-flight count. The hit rate is the headline KPI. Below ~50% the speculator is wasting more calls than it saves.

This pattern only makes sense when three conditions hold. The next call is highly predictable given the current call. The tool calls have non-trivial latency. Wasted speculations are cheap. Forge satisfies all three for the directory-to-file navigation in source-reader. The blame correlator's git_log could speculate git_diff(top_returned_sha), but the prediction is shakier (the LLM might want a different commit), so I left it out.

The interview talking surface is "streaming-aware tool orchestration." Speculative execution borrowed from CPU pipelines, applied to LLM tool calls. The math that determines whether it is worth using is hit-rate times min(latency, thinking-time) minus waste-rate times cost. For Forge's mocked-fixture tools the actual production value is questionable. For a research agent calling expensive search APIs the pattern is a real win.

The eval harness

A demo without measurement is just a vibe.

Forge ships with a CLI eval runner (scripts/eval.ts) that takes five reproducible bug scenarios from lib/eval/scenarios.ts and runs each one N times against the coordinator. Each lane's structured output is scored against the rubric in lib/eval/rubric.ts. The runner reports per-scenario mean and standard deviation in raw points, plus mean duration in milliseconds. Snapshots write to .forge/evals/<timestamp>.json for diffing against prior baselines.

The two ideas that matter here are graded rubric and n-runs aggregation.

Graded rubric. A binary "correct or not correct" verdict throws away too much information. Source-reader returning the right file with a slightly-wrong line range gets full file-match credit and proportional IoU credit, totaling 75 of 100 points. That is a useful signal. The Brier outcome is then derived from the rubric total by the 60% threshold rule, but the rubric total itself drives the human-facing scoreboard.

N-runs aggregation. LLM output is non-deterministic. A single "v2 got 17 out of 20 right" run is a sample from a distribution, not proof of improvement. The runner does N runs per scenario and reports mean plus standard deviation. v2 is only a real improvement over v1 if v2's mean exceeds v1's mean by more than roughly two standard deviations. Without this discipline every prompt change is a vibes-based judgment.

The live demo page renders a labeled illustrative snapshot rather than firing a fresh server-side eval on every visit. That tradeoff is the senior-engineer move: running for every visitor would cost roughly $1.50 per click and trip the daily cap immediately. Serving a stale snapshot without labeling would lie about freshness. Illustrative-and-labeled, with a pointer to the CLI runner and the snapshot JSON path, is the honest middle.

Prompt caching and the honesty rule

Every subagent's system message carries an Anthropic ephemeral cache breakpoint. The intention is for the static system prompt to be cached at Anthropic's edge and reused at 10% the input price on subsequent calls.

The catch: Anthropic requires a minimum of 1024 tokens to be cacheable on Sonnet. Forge's system prompts are roughly 80 to 110 tokens each, well below threshold. The breakpoints are syntactically correct but get silently ignored by the API.

I considered three options. First, pad the prompts artificially to clear the minimum and get fake-looking cache hits. Lying to a metric. Rejected. Second, move the cache breakpoint to include tool definitions, hoping the combined region exceeds 1024 tokens. Padding for its own sake. Rejected. Third, leave the prompts at their honest 80-to-110 token size and disclose in the UI that caching is below threshold for this demo's prompt sizes. Picked this.

The cost panel on the live demo now carries an orange "heads up" callout explaining the threshold and what a production-sized prompt with 3K to 5K tokens of instructions plus tool schemas plus RAG context would actually see. Expected reduction in per-call input cost is roughly 70 to 90% of the cached prefix portion once it activates. The cache layer is correctly wired; it just does not light up at this scale.

The interview-shaped lesson is that the value of admitting a real limitation beats the value of faking a metric to make it look good. "I wired prompt caching, measured it, noticed our prompts were below Anthropic's minimum, and called it out instead of hiding the zero percent" is a stronger talking point than a fake hit rate.

What I would change for production

Forge is a proof-of-concept architecture demo. A real production version for a customer running thousands of debug investigations per day would change five things.

The session store moves to Upstash or KV. The in-memory Map<sessionId, SessionState> works on localhost but degrades gracefully to broken on Vercel because each serverless invocation potentially has its own fresh heap. Resume across replicas requires shared storage. The SessionStore interface in lib/store.ts is shaped for this swap; the implementation switch is one file.

A process-shared semaphore around every Anthropic call. Current pLimit(4) is per-request bounding. At one hundred concurrent investigations that is four hundred parallel Anthropic calls, which will burn the org's tokens-per-minute quota in seconds. The fix is a distributed rate limiter (Upstash sliding window) keyed on the API key. Per-request pLimit caps fan-out within one investigation; global pLimit caps fan-out across all concurrent investigations. Two different bounding levels, both needed.

Real GitHub and Sentry adapters with retry-with-backoff and circuit breakers. The fixture tools today return canned data. Production tools need their own rate-limit handling, exponential backoff on transient errors, and a circuit breaker that fails fast when the upstream is down so the agent does not waste tokens flailing.

Write-capable tools require idempotency keys, transactional outboxes, and saga compensation. Forge's read-only fixtures make abort safe. A production agent that creates pull requests, posts to Slack, or charges a customer mid-run would leave side effects half-done on abort. Idempotency keys make retries safe at the receiver. Outboxes make multi-system writes atomic by staging the external call in a DB row inside the same transaction as the local state change. Sagas pair each step with a compensating step that rolls back partial work. That is also where Vercel Workflows or a comparable durable execution layer earns its keep: each step is checkpointed, mid-step crash means the next worker resumes from the checkpoint, no work duplicated.

Per-tenant isolation and observability. Production agents are multi-tenant. Calibration logs need to be scoped per team so team A's overconfident lane does not drag team B's weights down. Every LLM call, tool call, and lane needs an OpenTelemetry span so an SRE can debug a stuck investigation without reading source. Real observability is not a feature, it is the precondition for running this in front of paying customers.

What I learned

Three lessons stick.

The signal is global; the actuator is local. Distributed systems pattern, but it cuts harder in AI infrastructure because the in-flight LLM call is the work that needs cancelling and the cancellation control is a per-process JavaScript object. Pin this phrase in your head; it will save you an hour of confused debugging the first time you ship an agent demo to Vercel.

The lanes are stateless; the system has memory. Calibration as a feedback loop in the merge layer is what separates a multi-agent system from a bag of LLM calls. The lanes never learn. The system learns by tracking which lanes to trust.

Every metric needs an honest defense. Cache hit rate of zero, illustrative eval snapshot, stub correctness oracle, in-memory session store on Vercel. Every shortcut Forge takes is documented in the UI or the README with what it would look like in production. The discipline of disclosing the gap, instead of hiding it, is the move that signals senior judgment.

If you want to see all of this in action, forge.kevinmurphywebdev.com is the live demo. The full source is at github.com/midimurphdesigns/forge. The most interesting files to read in order are docs/AGENTS.md, lib/coordinator.ts, and lib/store.ts. Everything else is implementation detail.