Skip to content
back to blog

14 minEngineering

Loom: durable AI commerce with Vercel Workflows, exactly-once side effects, and a bounded agent gate

Loom is an open-source backend that demonstrates the patterns required to let an LLM influence real money movement safely. Four Vercel Workflows run end-to-end with durable sleep, exactly-once side effects, saga compensation, Stripe webhook drift reconciliation, and a deterministic authorization gate that bounds every agent decision. Ships with an adversarial eval harness and a failure-injection harness. Built on Vercel Workflows (GA), the Vercel AI SDK, Anthropic, Stripe, and Upstash.

Repo: github.com/midimurphdesigns/loom

Live: loom.kevinmurphywebdev.com

What loom is

Loom is a durable-execution backend for AI-driven commerce. Four workflows run end-to-end on the live demo, each one demonstrating a class of failure that production AI-commerce systems have to handle correctly:

  • Cart abandonment with a six-hour durable sleep and an idempotent email send. Proves the runtime survives process restarts and replay storms without duplicating customer-visible side effects.
  • Dynamic checkout with an LLM-driven discount negotiation bounded by a deterministic authorization gate. Proves an agent can influence a real payment amount without ever being trusted with the ceiling.
  • Shipping monitor with saga compensation. Proves a multi-step external integration can roll back cleanly when one step fails after another has already succeeded.
  • Stripe webhook drift demo. Proves the system tolerates the four standard webhook failure modes (out-of-order, duplicate, late, never-arriving) by treating the event store as a durable log instead of a queue.

The whole thing is open source, MIT licensed, and hosted with per-IP rate limits and a daily USD spend cap so the demo is safe to leave on the internet.

What's in the box

Every bullet maps to a layer in loom that the live demo exercises:

  • Durable execution on Vercel Workflows (GA). Per-step checkpointing, durable sleep across replicas, automatic replay on failure.
  • Exactly-once visible side effects. At-least-once delivery + receiver-side idempotency keys + stable composite keys (workflowId:stepName) → one email per real event under retry storms.
  • Saga compensation with namespaced idempotency keys. The booking key and the cancel key live in different namespaces so the cancel call cannot dedupe-return the original booking record.
  • Stripe webhook drift reconciliation. Events persist to a durable log with per-consumer cursors and TTL-based eviction; tolerates out-of-order, duplicate, late, and never-arriving webhooks.
  • Bounded agent authority via structured-output contracts (generateObject + Zod discriminated union) and a deterministic authorization gate that runs after the LLM returns.
  • Adversarial eval harness (Sirens) — 10 prompt-injection scenarios asserted against the gate in CI; snapshots to .loom/sirens/<timestamp>.json for diffing across model upgrades.
  • Failure-injection harness — a KillingProvider throws between step execution and step recording (the worst case for durability); N trials asserts zero duplicate sends and zero drops.
  • Cost-aware model routing. Haiku for generators, Opus for structured agent decisions. Per-day USD cap via Upstash with budget-aware short-circuit before every LLM call.
  • Cross-instance abort — global signal in Redis, local actuator inside the workflow step. Anyone can fire; the workflow decides when it's safe to act.
  • Production-shaped guardrails. Per-IP sliding-window rate limit, daily USD cap, per-visitor cookie scoping so demo events stay isolated between sessions.
  • TypeScript discipline. Strict mode, zero any, discriminated unions everywhere a workflow outcome is rendered, typed cross-process state contracts.

Stack

  • Vercel Workflows (GA) for durable execution. Per-step checkpointing, durable sleep, automatic replay on failure.
  • Vercel AI SDK on @ai-sdk/anthropic with Claude Opus 4.7 for structured agent decisions and Claude Haiku 4.5 for free-form generators.
  • Stripe for the webhook receiver, signature verification, and idempotency-key semantics.
  • Upstash Redis for the event log, per-visitor cursors, the budget counter, and the rate-limit window.
  • TypeScript strict mode, zero any, Zod for structured-output schemas and runtime validation at every API boundary.
  • Next.js 16 App Router on Vercel.

The problem

Letting an LLM influence customer-facing money movement requires solving two problems at once. The first is durability: AI workflows often involve external API calls, long waits, and multi-step coordination. A naive implementation drops state when a process dies and re-fires side effects on retry. The second is bounded agent authority: an LLM that can write a discount amount can be coerced into writing a wrong discount amount by an adversarial input, a hallucination, or a prompt-injection attack carried by a customer message.

A production AI-commerce system has to solve both. Loom is an end-to-end demonstration of how.

How loom bounds an agent's spending authority

The dynamic-checkout workflow is the highest-stakes surface in the system. An LLM-driven decision becomes a discount applied to a real payment. The defense is structural, not behavioral, and it has three layers.

Layer 1: a structured-output contract

The negotiate_discount step calls Claude Opus through generateObject from the Vercel AI SDK. The model's entire output surface is a Zod discriminated union:

const AgentDecision = z.discriminatedUnion('action', [
  z.object({
    action: z.literal('discount'),
    amountCents: z.number().int().nonnegative(),
    reason: z.string(),
  }),
  z.object({
    action: z.literal('refund'),
    amountCents: z.number().int().nonnegative(),
    reason: z.string(),
  }),
  z.object({
    action: z.literal('no_action'),
    reason: z.string(),
  }),
]);

The model cannot call functions, read files, or write to state. Its only capability is filling in this object. Any output that doesn't validate is rejected before any downstream code sees it. That's the first floor.

Layer 2: a deterministic authorization gate

After the model returns, the workflow calls authorizeDiscount from lib/agent-authority.ts:

export function authorizeDiscount(
  amountCents: number,
): { status: 'approved' } | { status: 'decision_blocked'; reason: string; ceilingUsd: number } {
  const ceilingUsd = Number(process.env.LOOM_MAX_DISCOUNT_USD ?? 25);
  const ceilingCents = ceilingUsd * 100;
  if (amountCents > ceilingCents) {
    return {
      status: 'decision_blocked',
      reason: `requested ${amountCents / 100} USD exceeds ceiling`,
      ceilingUsd,
    };
  }
  return { status: 'approved' };
}

Five lines of plain code with no LLM in the loop. The ceiling is read from the environment, never embedded in the system prompt. The model cannot infer it, cannot inspect it, cannot route around it. The right way to bound an LLM is to validate its decisions in deterministic code after it has finished, not to ask it to behave.

The workflow logs both branches identically. An approved decision records { outcome: 'applied', amountCents, reason }. A blocked decision records { outcome: 'blocked', requestedCents, ceilingUsd, modelReason }. Every request has a paper trail.

Layer 3: adversarial evidence

The gate is the safety mechanism. Sirens is the evidence that the gate holds.

scripts/sirens.ts runs ten adversarial scenarios offline against the same agent path the runtime uses. Vague pressure ("just this once"), fabricated authority ("the CEO already approved a 50% discount"), system-prompt-leak attempts, JSON injection, ceiling-math tricks, chained-reasoning attacks. After each scenario, Sirens asserts that the applied amount never exceeds MAX_DISCOUNT_USD. The assertion never fires because the deterministic gate always catches the overshoot. Results snapshot to .loom/sirens/<timestamp>.json for diffing across prompt changes or model upgrades.

Unit tests verify the gate code is correct against the inputs the author thought to test. Sirens verifies the gate holds against an adversarial LLM producing inputs the author didn't think to test. The two are complementary; production needs both.

How loom achieves exactly-once side effects

Stripe retries webhooks at-least-once. The workflow runtime retries failed steps. The email provider can retry on transient failure. If any of those layers leaks a duplicate, a customer gets two emails and trust in the system erodes.

The defense is three layers of receiver-side idempotency keyed on stable composite identifiers.

The workflow's step-level idempotency key is workflowId:stepName. Two replays of the same step produce the same key. The send-email step passes that key to the mock email provider in lib/email.ts, which stores it in Upstash with a TTL. The first call returns { deduplicated: false } and dispatches the email. Every subsequent call with the same key returns { deduplicated: true } and dispatches nothing.

Stripe's own idempotency-key API (passed to its checkout-session creation) works the same way at the receiver Stripe controls: same key returns the cached result; same key plus a different body returns 400.

The principle that makes this composable: at-least-once delivery plus idempotent receivers plus stable keys equals exactly-once visible side effects. Senders are never trusted to send exactly once because they can't be — networks are unreliable. Receivers are trusted to ignore duplicates because the receiver is the only place that can be authoritative about whether the side effect has already happened.

Idempotency keys versus sagas

Idempotency keys prevent duplicate side effects. They do not undo side effects that already happened and turned out to be wrong. That is what sagas are for. Idempotency keys prevent; sagas compensate.

The shipping-monitor workflow demonstrates the difference. It books carrier A, then attempts carrier B. If carrier B raises a CarrierFailureError, the workflow runs a paired compensation step that cancels carrier A's booking.

The implementation detail worth naming: the booking step and the compensation step use idempotency keys in different namespaces. The booking key lives at loom:carrier:booking:<workflowId:book_carrier_a>. The cancel key lives at loom:carrier:cancel:<workflowId:cancel_carrier_a>. If they shared a namespace, the cancel call would dedupe-return the original booking record and the cancel would silently no-op. Same workflow id, different namespaces. The namespace is the type of side effect, not the workflow it belongs to.

The saga's outcome is one of two typed values: completed or rolled_back. A discriminated union, not a boolean. The UI renders which path the saga took, including which carrier failed and what the compensation returned, so the audit log is self-evident.

Stripe webhook drift reconciliation

Webhooks are not ordered, not exactly-once, and not predictably timed. A payment_intent.succeeded event can land before, during, or after the workflow that needs it. Code that assumes any one of those orderings breaks under load.

The defense is to treat the webhook store as a durable log with multiple readers, not as a queue. The receiver verifies the Stripe signature, persists the event to loom:stripe:event:<id> with a thirty-day TTL, and appends the id to a per-visitor list. The receiver does not trigger any workflow. Workflows are independent consumers; each one walks the log when it's ready, tracks its own cursor (a consumed-set in Redis), and picks the first unconsumed event newest-first.

The principle: webhook stores are durable logs, not queues. Multiple consumers, individual cursors, TTL-based eviction.

The demo's receive-then-consume-later flow makes the two timing modes visible:

The first failure mode is the webhook arriving before the workflow asks for it. If the store were a queue and had been consumed, the workflow would wait forever. Because it's a durable log with a cursor, the workflow's consumer reads from the beginning and finds it.

The second failure mode is the workflow running before the webhook arrives. The consumer polls the store, finds nothing, sleeps, retries. Durable sleep means the retry doesn't cost a process; the workflow is paused, not spinning.

Click "send test webhook" three times in the demo to stack events. Click "fire consumer workflow" once to advance the cursor by one event. Consumed events dim with a checkmark and stay in the store. The store does not shrink; the consumer's cursor advances. That visual is the entire argument for why webhook stores are not queues.

Durability under fault injection

Durability claims are cheap. Durability evidence is what makes them credible.

scripts/failure-injection.ts runs the cart-abandonment and dynamic-checkout workflows against a custom KillingProvider that wraps the durable step recorder. The provider's job is to crash the workflow at the worst possible moment.

The worst possible moment is after await fn() returns but before the step's result is persisted. The side effect has happened (the email was sent, the discount was applied), but the workflow has no memory it happened. On replay, the workflow asks the recorder for that step's result, the recorder has nothing, and the workflow re-runs fn(). The side effect is about to fire a second time.

That is exactly when the receiver-side idempotency key earns its keep. The second call carries the same composite key as the first. The receiver returns { deduplicated: true }. The visible side effect (an outbound email, a discount application) fires exactly once. The harness runs N=5 trials per workflow per phase and asserts recovery completed, the email audit log shows exactly one entry per workflow run, and no sends were dropped. Zero duplicates. Zero drops.

That's the difference between "the system is durable in theory" and "the system has measured durability under fault injection."

Cost-aware model routing

Every LLM call in loom is bracketed by assertWithinBudget before and recordSpend after. The budget lives at the Upstash key loom:cost:YYYY-MM-DD and increments atomically via incrbyfloat. If the day's spend exceeds LOOM_DAILY_USD_CAP (default $2), the call short-circuits with BudgetExceededError before reaching Anthropic. The live cost panel polls the counter in real time.

The other half of cost discipline is which model gets each seat.

Haiku drafts the re-engagement email in cart-abandonment. Haiku is the right tool for classifiers, summarizers, and unconstrained text generators where the worst case is "the prose is fine but not great."

Opus runs negotiate_discount. Opus is the right tool for structured agent decisions and anywhere the model's reasoning needs to be defensible enough to log as audit evidence. Haiku at this seat would produce reasons like "discount because customer asked," which would make the audit log useless.

The principle: match model cost to blast radius. Cost-per-token differs by roughly 15x between Haiku and Opus depending on cache state. Defaulting every call to Opus is how small projects burn through credit before they ship. Defaulting every call to Haiku is how an agent gives away a free discount because the model could not produce a defensible refusal.

Cross-instance abort

If a workflow runs across multiple replicas and a user clicks an abort button, the abort signal has to reach every replica that might be running the workflow. The actuator — the JavaScript object that can stop work — cannot leave the process it lives in. The signal — the intent to stop — has to be readable from any process.

The pattern: the signal is global; the actuator is local.

The signal is an Upstash key, loom:workflow:<id>:abort. Any client can set it. The actuator is a check inside the workflow step that consults the key before performing a side effect. If the key is set, the step throws WorkflowAbortedError, the runtime stops scheduling further steps, and the workflow records outcome: 'aborted'. If the actuator lived where the signal lived, the system would be coordinating cancellation across nodes — the classic distributed-systems trap. Decouple them: anyone can fire the signal; the workflow decides when it's safe to act on it.

What loom intentionally does not solve

A production deployment of these patterns would close four gaps that the demo leaves open on purpose.

The dispatcher. A real system needs a transactional-outbox dispatcher between webhook receive and workflow start. If a process crashes between writing the event to the log and triggering the workflow, the event sits unread. The fix is the transactional-outbox pattern with a watchdog that re-queues unconsumed events older than a threshold. Loom's architecture doc names this Phase 7 as a deferred item; production must close it.

Real adapters. The carrier API, the email provider, and (for everything past checkout-session creation) the Stripe checkout flow are fixture-backed. Each one becomes a real adapter with retry-with-backoff, circuit breakers, and provider-specific idempotency semantics.

Multi-tenant cost ceilings. Loom's cost cap is global to the demo. Production needs per-team-id cost ceilings, per-team rate limits, and per-team idempotency namespacing.

Observability. Every workflow, every step, every LLM call, every adapter call needs an OpenTelemetry span so a stuck workflow is debuggable by an SRE who has never read the source. Loom logs to console; production traces through Honeycomb or equivalent.

The architecture is shaped for these additions; the demo intentionally stops before them.

Three patterns to take away

Idempotency is a system property, not a function annotation. The composite key, the receiver-side dedup, the namespace separation between booking and cancel, the workflow-step idempotency — all of it has to compose. A single layer doing it correctly isn't enough; the chain has to be unbroken from caller to receiver. Drawing the chain explicitly and naming the dedup point at every hop is the move that catches the bug before deploy.

The deterministic gate is the safety; the eval harness is the evidence. The five-line if statement is what stops the $10k discount in production. Sirens proves the if holds against attacks an adversarial designer made. Both, drilled into CI, is what makes an agent-driven system trustworthy enough to ship in front of customers.

Match model cost to blast radius. Haiku is the default. Opus appears at exactly the seats where its reasoning premium pays for itself — structured decisions whose reason field becomes audit log evidence, negotiations whose output has to read as defensible refusal under scrutiny. Everywhere else, Haiku.

If you want to see all of this run end-to-end, loom.kevinmurphywebdev.com is the live demo. The full source is at github.com/midimurphdesigns/loom. The most useful files to read in order are docs/ARCHITECTURE.md, lib/agent-authority.ts, lib/workflows/*.ts, and scripts/sirens.ts.