May 9, 20267 minEngineering

Building fieldops-mcp: designing the tool surface an agent actually uses

I built a small MCP server that turns a fictional dispatcher's workflow into agent tools. The interesting part wasn't the code. It was the design choices that decide what the agent can do at all. Here's what I built and what shipping it taught me.

Repo: github.com/midimurphdesigns/fieldops-mcp

Live showcase: fieldops-mcp.kevinmurphywebdev.com

Companion posts: fedbench (eval rigor) and grant-pilot (sub-agent orchestration).

I spent a weekend on a second small AI artifact: fieldops-mcp, an MCP server that exposes a small-business field-services workflow as agent tools. It's a sibling to fedbench: same engineering register, different question. fedbench is about measuring whether an agent is right. fieldops-mcp is about shaping what the agent can do at all.

What the server does

There's a fictional twelve-person field-services company in src/fixtures/data.json: eight techs with overlapping skills, a dozen open jobs, eight customers. None of it real, all of it small enough to hold in your head in thirty seconds.

The server exposes that world as six MCP tools, and the choice of which six is most of the actual design work:

list_open_jobs: read the queue, optionally filtered, sorted urgent-first.
find_available_techs: search techs by skill across a time window, ranked by free minutes.
assign_job: book a tech, rejects on skill mismatch or schedule conflict with a typed error.
draft_customer_message: compose a confirmation, delay, or reschedule message grounded in real customer and tech data.
compute_utilization: aggregate per-tech load over a forward window plus a capacity-gap summary.
flag_for_human: escalate ambiguous cases to the dispatcher's review queue rather than guess.

A read, a search, a mutation, a composition, an aggregation, an escalation. Six tools, six structurally distinct shapes.

Why the shape diversity matters

The mistake I almost made on my first pass was exposing six flavors of search (list-by-status, list-by-skill, list-by-customer, list-by-region). The agent doesn't gain new capability from any of those. It gains new ways to do the thing it could already do. That's surface area without leverage.

The shapes I actually picked each unlock something the agent couldn't reasonably do without the tool. Search-and-rank is the join the model would otherwise have to do by hand on a list result, badly. Mutation with typed errors is the affordance the model needs to know its write failed (and why) instead of optimistically assuming. Composition reads three records and writes a customer-facing message that quotes the real customer name, the real tech, and the real job description. The agent picks the intent; the tool picks the substance.

A live Claude transcript captured during the smoke test shows the difference:

> find_available_techs({"skill":"hvac","window_start":"2026-05-09T13:00:00Z","window_end":"2026-05-09T16:00:00Z"})
{ "matches": [
    { "tech": "T-01 (Tech A.M.)", "free_minutes_in_window": 180 },
    { "tech": "T-02 (Tech B.K.)", "free_minutes_in_window": 180 },
    { "tech": "T-08 (Tech H.L.)", "free_minutes_in_window": 180 },
    { "tech": "T-05 (Tech E.N.)", "free_minutes_in_window": 120 }
] }

A: I have 4 available HVAC techs. J-2001 requires 120 minutes, and Tech E.N.
   has exactly 120 free minutes while the others have 180. I'll assign Tech A.M.
   to give us more buffer.

The model isn't matching strings. It's reasoning over the shape of the tool's output, picking the tech with margin instead of the tech that just barely fits. The agent does the judgment; the server does the math.

What's deliberately not in the server

Two absences worth calling out, because they're as much the design as the presences:

draft_customer_message does not send. It returns text. A tool that sends is one the agent cannot undo, and any responsible deployment needs a human-in-the-loop send step. The boundary is intentional.

flag_for_human is a tool, not a system-prompt instruction. I could have asked the model to refuse via prompting. Giving it an explicit escalation tool with a structured reason field is a more honest design: the refusal is visible in the tool-call log, the dispatcher gets a queue item they can act on, and the behavior is gated by something stronger than instruction-following.

Try it yourself

The repo is MIT-licensed and runs locally. You'll need Bun; no API key required for the deterministic mode.

git clone https://github.com/midimurphdesigns/fieldops-mcp.git
cd fieldops-mcp
bun install

# Deterministic smoke (no API key, ~2 seconds)
bun run smoke

That run spawns the MCP server, connects an MCP client to it over stdio, and walks all six tools end-to-end.

To actually talk to the server from Claude Desktop, drop this into your MCP config and restart the app:

{
  "mcpServers": {
    "fieldops": {
      "command": "bun",
      "args": ["run", "/absolute/path/to/fieldops-mcp/src/server.ts"]
    }
  }
}

Then ask Claude things like "what urgent jobs are open this morning, and who's the best HVAC tech for J-2001 between 1 and 4 PM today?" and watch the tool calls happen. The full live transcript is in docs/TRANSCRIPTS.md.

How these skills transfer

Six tools wrapping a fictional dispatcher is the example. The skills are the thing. Real bottlenecks the same shape addresses:

Agents making up state instead of mutating it. Production agent integrations fail catastrophically when the model assumes its tool call succeeded silently. Mutation tools that return typed conflict errors ({ error: "conflict", message }) force the model to read a failure and route around it: the difference between an agent that double-books and an agent that asks a human.
Agents sending messages they shouldn't. "Drafts text, doesn't send" is a structural human-in-the-loop checkpoint, not a promise. Customer-facing AI in regulated industries lives or dies by that boundary; the tool surface is where you enforce it.
Agents refusing silently or guessing. An explicit flag_for_human tool with a structured reason field makes refusal visible in the tool-call log and queue-routable to a human reviewer. That's a feature regulated buyers actually pay for; it's also a way to gather training data for the next iteration.
Enterprise integrations with 50 thin wrappers. Most "AI on top of our API" projects fail because they expose 50 endpoints to the model and trust the model to pick. Six well-shaped tools (read / search-and-rank / mutate / compose / aggregate / escalate) beat 50 thin wrappers because the model doesn't have to reason about your schema. It can reason about the workflow.
AI-native products that need predictable agent behavior. MCP as a contract means the agent's capability is decided in your code, not the prompt. That's the difference between a feature you can ship and a feature you can demo.

This is the layer between "the agent" and "the system," and it's where most of the applied-AI product work in 2026 actually happens.

Why I built this one

A lot of the work I want to be doing more of in the next few years sits at the boundary between a real team's workflow and the agent that helps them run it. The skill I find most interesting in 2026 is designing the tool surface itself. Picking which tools an agent gets, what they return, how their errors are shaped, where the human stays in the loop. Most of the agent capability you experience as a user is decided in those choices, before the model is even prompted.

fieldops-mcp is the second of three companion artifacts that explore that shape from different angles. fedbench gave me a way to talk about evaluation rigor. fieldops-mcp gives me a way to talk about tool-surface design. grant-pilot ties both together: a multi-turn agent that composes specialist sub-agents over real public data, with budget caps, rate limits, and structured-failure routing. They're siblings in my head and they're cross-linked in each other's READMEs.

If you're working on agent-shaped problems (your own product, your own team, inside Deloitte's AI practice, or at any of the AI-native companies building this kind of system) I'd genuinely enjoy a conversation. The version I find most useful is usually the smallest: one specific workflow, what tools you ended up exposing, what you almost shipped and pulled back. I'd rather swap notes on what's actually working than trade abstractions about agents in general.

Easiest way to reach me is the contact page on this site, or just connect on LinkedIn. The repo is at github.com/midimurphdesigns/fieldops-mcp, the live showcase is at fieldops-mcp.kevinmurphywebdev.com, and the docs in there go deeper than this post: an architecture overview, a tool-design heuristics file, fixture conventions, and a transcript-capture protocol so the example runs in the README stay aligned with real server behavior.

ask kev-o