Skip to content
back to portfolio

Open-source and personal projects2026

forge — multi-agent debugging concierge

Point it at a stack trace and four specialist subagents fan out in parallel, each with a focused tool set, before a coordinator merges their structured findings into ranked, calibration-weighted hypotheses. Parallel fan-out via Promise.all plus pLimit bounded concurrency. Two-pass agent pattern (generateText then generateObject) for reliable structured output from a tool-using loop. Durable session state with resumable streams. Preemptive abort plumbed end-to-end via AbortSignal, with cross-instance signaling through Upstash so the stop button works across Vercel serverless replicas. Brier-score calibration log as a system-level feedback loop. Five-scenario eval harness with graded rubric and n-runs aggregation. Anthropic prompt-caching breakpoints with honest below-threshold disclosure. Hardened with per-IP rate limit and daily USD spend cap.

Author · Applied AITypeScriptNext.js 16Vercel AI SDKAnthropic SDKp-limitZodUpstashTailwind v4

Repo: github.com/midimurphdesigns/forge

Live demo: forge.kevinmurphywebdev.com

Read the full story: Building forge

Forge is a multi-agent debugging concierge. Paste a stack trace; four specialist subagents fan out in parallel, each with its own focused tool set; a coordinator merges their structured findings into ranked, calibration-weighted hypotheses; and the whole investigation streams to the browser as it unfolds. 4 parallel subagents, ~$0.08 per full investigation on Claude Sonnet 4.6, ~1s click-to-aborted latency on Vercel.

How it's built

Next.js 16 App Router on Vercel, TypeScript strict, zero any. The Vercel AI SDK (generateText plus generateObject) on @ai-sdk/anthropic with Claude Sonnet 4.6. Parallel lane dispatch via Promise.all over pLimit(4), with each lane catching its own errors and returning a typed LaneOutcome discriminated union so the outer promise never rejects. Two-pass agent pattern per lane: generateText with tools and stepCountIs(N) for the investigation loop, then generateObject with a Zod schema to coerce the transcript into a typed result. The AI SDK enforces this split because mixing tool use and structured output in one call hallucinates one or the other.

Session state is durable. Each investigation gets a UUID, the browser pins it to the URL via history.replaceState, and refresh fires a resume GET that replays the buffered lane state and the merged hypotheses. The same UUID can be shared across browsers and they all subscribe to the same snapshot. The session store is interface-shaped so swapping the dev in-memory implementation for Upstash KV is a one-file change.

Per-lane interrupt is preemptive, not cooperative. The click handler optimistically sets the lane status to stopping, writes the abort intent to an Upstash Set (the cross-instance signal), and the coordinator's 500ms poll loop reads the Set and calls controller.abort() on the lane's local AbortController. The signal propagates through generateText's abortSignal parameter all the way down to the fetch to api.anthropic.com, which closes its socket. Click-to-aborted round-trip is roughly one second on Vercel, instant on localhost.

Calibration as system-level learning

Every (predicted confidence, rubric outcome) pair gets logged to Upstash after each session. Brier scores are computed per lane (mean squared error between predicted probability and binary outcome). Weights derive from mean outcome divided by mean predicted, clamped to the range 0.5 to 1.5 with a three-sample floor before weighting activates. The coordinator multiplies each lane's confidence by its weight before ranking the merged hypotheses. The lanes themselves are stateless function calls; the system's memory lives in the calibration log. A chronically overconfident lane gets downweighted; an underconfident lane gets upweighted. The system gets better at surfacing correct answers even though no individual lane improves.

Eval discipline

Forge ships with a CLI eval runner (scripts/eval.ts) that runs five reproducible bug scenarios N times each against a graded rubric. The rubric scores correctness components (file match, line-range intersection-over-union, top suspect match, severity exact) plus process components (snippet present, reasoning length, candidate explanations). N-runs aggregation reports mean and standard deviation per scenario so a prompt change has to clear statistical significance to count as an improvement, not single-run noise. The Brier outcome that feeds calibration is derived from total rubric score with a 60% threshold so partial-correct answers (right file, slightly wrong line range) count as the useful signal they are.

Artifacts worth reading

  • lib/coordinator.ts. The fan-out, the per-lane AbortController, the cross-instance Upstash poll, the calibration-aware merge. The center of the system.
  • docs/AGENTS.md. The design contract for the four subagents, written before the code so it could be re-explained from words later.
  • lib/eval/rubric.ts. The graded scoring system. Where IoU and the 60% Brier threshold live.
  • lib/store.ts. The session store interface, the in-memory implementation, the Upstash-backed abort flag, and the registerLaneController hook. Where the global-signal vs local-actuator split is most visible.

The trade-offs

The session store is in-memory in the live demo, which means resume across replicas degrades to occasional 404s on Vercel. The calibration log and the abort signal already moved to Upstash; the session store is a one-file follow-up. Prompt-caching breakpoints are wired on every system message but Anthropic's 1024-token minimum means they sit unused at this prompt size, which the live cost panel discloses honestly rather than padding the prompt to fake hits. The four subagent tools are fixture-backed; the real GitHub and Sentry adapters are a separate phase that doesn't change the architecture. Production use with write-capable tools (create_pr, send_message, charge_customer) would also need idempotency keys plus a transactional outbox plus saga compensation plus a durable execution layer like Vercel Workflows to make aborts safe in the presence of side effects. The architecture is shaped for those additions; the demo intentionally stops before them.