May 15, 20268 minEngineering

Building Kev-O: a grounded RAG chatbot trained on my own writing

Kev-O answers questions about my work using only the public corpus I've written. Hybrid BM25 plus cross-encoder rerank, Claude streaming through the Vercel AI SDK, a daily USD cap so it can't blow up my API bill. Three surfaces, one brain. Here's what I built and what choices ended up mattering.

Repo: github.com/midimurphdesigns/kev-o-ai-search

Live: kev-o.kevinmurphywebdev.com, plus the ⌘K palette on every page of this site and the inline punch-ins at the foot of every blog post and project case study.

Companion artifacts: fedbench (RAG eval rigor), grant-pilot (multi-turn agent composition), and the mdx-corpus primitive this build extracted.

Kev-O is a grounded chatbot that answers questions about my work using only the public writing I've put on the record. Blog posts, project case studies, resume, About page, the READMEs of my open-source repos. He cites his receipts. He refuses to invent.

I built him because the standard portfolio chatbot is a tell. Most of them are GPT with a system prompt that says "you are Kevin's assistant," which produces confident-sounding garbage and gets less interesting every time you talk to it. I wanted the opposite: a surface where the interesting thing is what's actually in the corpus, and where the bot is structurally incapable of saying things I didn't write.

This post is what I built and what choices ended up mattering.

What's on screen

Three surfaces, all backed by the same retrieval pipeline:

The subdomain. kev-o.kevinmurphywebdev.com is a standalone full-page conversation, the URL a visitor can share.
The ⌘K palette. Every page on this site has a global command palette. Top of the panel is Kev-O. The site-search/jump-to-page list is secondary, deliberately below.
Inline punch-ins. At the foot of every curated blog post and project case study is an input that's already focused on the page you're reading. Ask about FedNow at the bottom of the FedNow case study and Kev-O grounds his answer in the page first.

All three call the same /api/kev-o endpoint on this site. The subdomain is a thin proxy. One brain, three surfaces.

The retrieval pipeline

query
  ↓
BM25 over the full corpus  →  top-20 candidates  (lexical, ~3ms, free)
  ↓
Voyage rerank-2.5          →  top-6 winners      (semantic, ~120ms, ~$0.0002)
  ↓
Optional page-context passage at position 0 for inline punch-ins
  ↓
Claude Sonnet 4.6 streaming via Vercel AI SDK    (temp 0.4, max 800 tokens)
  ↓
text streams to the surface

The corpus is about 212 chunks across four sources: this site's MDX content, the resume JSON, the About page prose, and the live READMEs of five OSS repos fetched at build time.

Why hybrid, not pure embeddings

The dominant signal in the queries Kev-O sees is the literal vocabulary. People type "what did Kevin do on FedNow" or "is grant-pilot federal." Short, factual, domain-specific. BM25 from 1995 still beats dense-vector retrievers on this query shape because the words ARE the signal.

But BM25 is brittle on paraphrase. "The federal payments rail" should match "FedNow"; "the grant-finding agent" should match "grant-pilot." So the second stage is a cross-encoder rerank from Voyage's voyage-rerank-2.5, which re-orders the top-20 lexical candidates with full semantic awareness.

This is the same pipeline shape I use in fedbench, where I measured BM25-only vs hybrid vs pure-embedding on a hand-labelled federal-grants benchmark. Hybrid wins on recall@5 by a wide enough margin that it's worth the rerank cost. Pure embeddings lose on the short-query case because they over-smooth the lexical signal.

What I didn't do

I didn't ship vector embeddings of my own. Two reasons:

The corpus is small (212 chunks, ~120KB JSON). BM25 over 212 chunks is faster than the network round-trip would be to a vector store.
I had the rerank-only option. Voyage's reranker takes raw text candidates and a query and returns a relevance score. There's no embedding step on my side. That removes a moving piece and keeps the corpus build pipeline as one MDX → JSON command.

If the corpus were 10x larger this calculation flips. At that point an embedding index is worth standing up, but BM25 is the right first-stage filter regardless.

The voice problem

The model is Claude Sonnet 4.6. The default Sonnet voice is competent-but-bland. Kev-O needed personality without sliding into the chatbot-bit territory where every response opens with "Great question!"

The system prompt does three things to shape voice:

Anchors who's talking. Kev-O is described as a competent engineer Kevin asked to handle these conversations. Not "Kevin's AI assistant." Not "a language model." A voice with a stance.
Constrains length. Default to two short paragraphs. The visitor is evaluating, not reading an essay. If a question genuinely needs more, expand to three. Never lists of bullets unless asked.
Forces citations. Every passage gets a [1], [2] reference in the response. The references point to real URLs into my writing. If Kev-O wants to make a claim, he has to ground it in a passage he can show you.

The constraint that mattered most was the third one. Citations aren't decoration. They're how I get a chatbot to stop hallucinating. If the model can't ground a claim, it has to say so.

Rate limiting and cost cap

Three layers:

Per-IP sliding window. 50 requests per hour via Upstash Redis. Returns 429 with a Retry-After header. Plenty for a visitor to evaluate; tight enough that nobody scrapes the model for free.

Daily USD cap. Defaults to five dollars per UTC day. The cost is charged post-stream in the AI SDK's onFinish callback using the actual token counts the model reports, not an estimate. When the cap is hit, every request gets a friendly "napping until tomorrow" response with the seconds-until-midnight retry-after. The cap is the real safeguard. Per-IP limits protect against any one attacker; the USD cap protects against a coordinated swarm I never see.

Owner bypass. A separate /api/kev-o-admin route accepts ?key=<KEV_O_ADMIN_KEY> and drops an HttpOnly, SameSite=Strict, Secure cookie that's good for thirty days. The route fails closed if the env var isn't set or is under 24 characters, uses Node's timingSafeEqual for the comparison so there's no length-leak oracle, and is per-IP rate-limited at 5 requests per hour BEFORE the key check. That last bit is the one I'm proudest of. An attacker exhausts their guess budget regardless of whether they guessed right. Wrong keys return 404, not 401, so the route is indistinguishable from one that doesn't exist.

What I extracted along the way

The interesting build artifact wasn't the chatbot. It was a small npm primitive called mdx-corpus that I pulled out of the corpus build step. It takes a directory of MDX files and emits retrieval-ready JSON chunks: front-matter intact, headings preserved as chunk anchors, code fences kept whole. Three hundred lines of TypeScript, nineteen tests, tsup dual ESM/CJS build. The kind of thing I would have wanted to find before I had to write it.

That extraction is the part I'd most recommend. Building Kev-O didn't teach me much I didn't already know; pulling out the primitive forced me to write the API I would want to consume as a stranger. That's where the design judgment shows up.

What I'd change

Two things I'd revisit if I shipped this to a real product team:

Per-corpus prompt tuning. The voice prompt assumes the corpus is mine and the tone should match. For a multi-tenant version I'd lift the persona description out of the constant and into a build-time argument so the same retrieval pipeline can wear different voices.

Recall@k as a first-class signal. The eval suite scores on grounded behavior (does Kev-O cite the right URL, refuse off-topic, redirect on confidential probes). It doesn't yet quantify retrieval recall at k=6 against a hand-labelled gold set. That's the next eval tier. It would let me tune the BM25 candidate count and the rerank top-k against a number instead of a vibe.

Neither of those is hard. Both are the next move if this stops being a portfolio artifact and starts being something other people deploy.

What I did build for safety

The shipping question for a public chatbot grounded in your own writing isn't "does it hallucinate." It's "what happens when someone tries to make it embarrass you on Twitter." So:

A private eval suite runs against the live production endpoint after every deploy. Six categories: grounding (does it cite?), refusal (does it stay on-script?), hallucination (does it invent employers?), persona (does it survive a jailbreak?), prompt injection (can a page override its instructions?), and confidential probes (does it leak the existence of private projects?). Twenty-four prompts. Mostly deterministic matchers; Claude Haiku as judge for the open-ended ones. The suite catches regressions before a real user does.

The corpus build itself has a deny-list scan that fails the Vercel deploy if a private project name ever lands in MDX, resume.json, or a fetched README. Defending earlier in the pipeline is strictly better than defending later in the model.

The eval repo is private. The harness isn't sensitive; the test inputs are. If you publish your jailbreak probes, you've handed an attacker your threat model.

The thing this is actually proof of

Kev-O is a portfolio object. The point isn't that you should use it. The point is that I built it end-to-end (retrieval pipeline, prompt construction, streaming UI with character-by-character reveal, rate-limit infrastructure, owner-bypass with paranoid security posture), and the result is one URL visitors can click and immediately interact with. Read a case study, then ask the bot a follow-up at the bottom of the page. Same brain. Same voice.

If you're hiring for Applied AI or product engineering and you've gotten this far, ask Kev-O why I'd be good for the role. He's read the corpus.

ask kev-o