Skip to content
back to blog

6 minEngineering

mdx-corpus: a tiny package that turns an MDX directory into a retrieval-ready corpus

While building Kev-O I kept rewriting the same parse-and-chunk step for MDX content. So I pulled it out into a small npm package that does file-in / JSON-out and gets out of the way. No embeddings, no vector store, no LLM. Three responsibilities, sharp boundaries.

npm: npm install mdx-corpus

Repo: github.com/midimurphdesigns/mdx-corpus

Used in: Kev-O, and in this site's own corpus build.

Companion post: Building Kev-O (the build this got extracted from).

mdx-corpus is a small npm package that does the boring, load-bearing part of RAG over a directory of MDX files: parse the frontmatter, strip JSX while keeping the prose inside it, chunk on semantic boundaries, and emit clean passages with deep-link URLs and source metadata. Bring your own embeddings, your own vector store, your own LLM. The package is pure file-in / JSON-out.

I built it because I kept writing the same fifty lines of MDX-handling glue every time I wanted to retrieve over my own writing. It's the kind of thing I would have wanted to find when I went looking. Now I'm publishing it so the next person doesn't have to write it either.

What it does

Three responsibilities, in order:

  1. Walks one or more source directories and reads every .md / .mdx file. Frontmatter is parsed out and kept on the chunk; the body is normalized.
  2. Strips JSX components while preserving their children. A <Stat>3</Stat> becomes the text 3, not the string <Stat>3</Stat> or empty. A <Callout>The point.</Callout> becomes The point. Embeddings stop being polluted by component names and prop syntax.
  3. Chunks on heading boundaries by default, falling back to character-budget splits when a section runs long. Each chunk carries its heading so retrieval can cite "from /blog/x under section Y."

That's the whole package. Roughly three hundred lines of TypeScript, nineteen tests, dual ESM/CJS build via tsup, zero runtime dependencies beyond a small frontmatter parser.

Why this is its own package

The temptation was to write the parse step inline in Kev-O and move on. Three reasons it earned a separate package:

It's reusable across my own portfolio. Kev-O retrieves over my blog and project content. This site's own search index could use the same chunks. A future newsletter archive could too. Three consumers, one source of truth.

The boundary is sharp. The package does parse-and-chunk and nothing else. No embeddings, no vector store, no retrieval logic, no LLM orchestration. Those are downstream concerns I want to make different choices about per consumer. A package that owned the whole pipeline would be a framework, and I'd be back to writing glue around it.

It's small enough to read. A stranger evaluating it can read the source in fifteen minutes. They'll see that it doesn't reach for unnecessary dependencies and that the test suite covers the gnarly cases (frontmatter with quote characters in values, fenced code blocks containing what looks like a heading, JSX with attributes spanning multiple lines).

What was tricky

Two surprises during the build that the README doesn't show.

JSX-with-children is harder than JSX-as-self-closing. A self-closing <Stat n="3" /> you can regex out cleanly. A <Callout title="Note">The body keeps **markdown** inside it.</Callout> needs to preserve the markdown body while dropping the component shell. The MDX AST handles this correctly, but I didn't want a heavy parser dependency. So the package does a small hand-rolled walk that tracks tag depth and only strips the opening and closing wrappers, leaving everything between them intact for the chunker.

Heading-based chunking has a long-tail problem. A blog post with a 4000-word section under one heading produces one giant chunk that blows the token budget. The right answer was a hybrid: chunk on headings first, then for any chunk over the token limit, split on paragraph breaks until under budget. The chunk metadata still carries the original heading so citations remain accurate even after a long section gets split.

What it does NOT do

By design, the package refuses to grow in three directions:

  • No embedding generation. Voyage, OpenAI, Cohere, a local model: your call, downstream of this package.
  • No vector storage. Pgvector, Pinecone, Turbopuffer, a JSON file with BM25 on top: also your call.
  • No retrieval logic. BM25, cosine similarity, hybrid reranking, cross-encoders: still your call.

This is the discipline that keeps it useful. Every refusal is a thing I don't have to maintain, version, or document. The package stays a sharp tool.

End-to-end example

import { buildCorpus } from 'mdx-corpus';
import { writeFile } from 'node:fs/promises';

const corpus = await buildCorpus({
  sources: [
    { dir: './content/blog',     baseUrl: '/blog',      kind: 'blog' },
    { dir: './content/projects', baseUrl: '/portfolio', kind: 'project' },
  ],
  chunkBy: 'heading',
  maxChunkTokens: 500,
  includeFrontmatter: ['title', 'date', 'tags'],
});

await writeFile('corpus.json', JSON.stringify(corpus));

You hand the output to your embedder, push the vectors into your store, retrieve at query time. The package is upstream of every interesting choice you get to make.

How Kev-O uses it

In Kev-O's case, the corpus is built at deploy time via Next.js's prebuild step. About two hundred chunks across four sources: this blog, the project case studies, my resume JSON, and the READMEs of five open-source repos pulled at build time. The resulting corpus.json is committed to the deploy artifact and read by the API route at request time. BM25 runs in-memory over those two hundred chunks in under five milliseconds. Voyage's rerank-2.5 narrows the top twenty candidates to the top six. Claude generates from there.

The whole retrieval pipeline is maybe one hundred lines of code on top of mdx-corpus. That's the package doing its job: get the parse-and-chunk right so I can focus on the parts that actually matter to the visitor.

What I might add later

I'm deliberately keeping the API small until I have a real second consumer. The two things on the maybe-someday list:

  • A prune option that drops chunks under a minimum token count (currently they just emit; consumers filter). Probably the next thing I add the first time I rebuild against a corpus with a lot of one-paragraph posts.
  • An onChunk hook so consumers can attach computed fields (read time, language detection, tags inferred from content) at parse time instead of post-processing. Useful but I want to see two callers ask for it before I commit to the surface.

Most package design failures I've watched come from adding the surface before the second consumer asks. So the package stays small until the second consumer is real.

The recommendation, if you're building RAG over your own writing

Use mdx-corpus if your corpus is MDX files in a Next.js / Astro / Remix repo. Don't use it if your corpus is raw markdown and you're not on the JSX side of the world. The whole reason this package exists is the JSX-component handling, which is what makes MDX content uniquely annoying to embed cleanly.

If mdx-corpus doesn't fit your shape, the relevant thing to copy is the chunking strategy: heading first, paragraph-fallback for long sections, metadata carried through so retrieval can cite the right URL. That's the load-bearing part.

ask kev-o