Open-source and personal projects2026

mdx-corpus — npm package: MDX directory to retrieval-ready corpus

Public npm package that turns a directory of MDX files into a retrieval-ready JSON corpus for RAG pipelines. Parses frontmatter, strips JSX components while preserving their text children, chunks on heading boundaries with a paragraph-split fallback for long sections, emits passages carrying source URL and metadata for citation. ~300 lines of TypeScript, 19 tests, dual ESM/CJS build via tsup, zero runtime dependencies. Extracted from the kev-o-ai-search build because the parse-and-chunk step is reusable across surfaces; the package deliberately refuses to grow into embeddings, vector storage, or retrieval logic.

Author · Applied AITypeScripttsupvitestMDXnpm

npm: npm install mdx-corpus

Repo: github.com/midimurphdesigns/mdx-corpus

Used in: Kev-O and this site's own corpus build.

Read the full story: Building mdx-corpus

A small npm package that turns a directory of MDX files into a retrieval-ready JSON corpus. Parses the frontmatter, strips JSX components while keeping their text content, chunks on heading boundaries with a paragraph-split fallback for long sections, and emits clean passages carrying source URL and metadata for citation. ~300 lines of TypeScript, 19 tests, dual ESM/CJS build, zero runtime dependencies. Pure file-in / JSON-out.

How it's built

TypeScript strict, tsup for the dual ESM/CJS build, vitest for the test suite. The package's three responsibilities are split across three small modules: parse.ts (frontmatter and JSX stripping), chunk.ts (heading-based chunking with character-budget fallback), and index.ts (the public buildCorpus API that walks a directory tree and composes the other two). JSX handling is hand-rolled instead of pulling in a full MDX AST: the package tracks tag depth and strips only the opening and closing wrappers, preserving everything between them so a <Callout>The point.</Callout> survives as The point. in the corpus.

What this is NOT

By design, the package refuses three jobs that would balloon its surface: no embedding generation, no vector storage, no retrieval logic. Voyage, OpenAI, Cohere, pgvector, Pinecone, BM25, cosine, hybrid rerank: all downstream of this package. Every refusal is a thing the package doesn't have to maintain, version, or document. Restraint is the design.

Artifacts worth reading

The chunking strategy. Heading-first, paragraph-fallback for long sections, metadata carried through so retrieval can cite the right URL even after a section gets split.
The JSX-stripping walker. Small hand-rolled implementation that handles components-with-children correctly without a heavy parser dependency.
The test suite. Covers the gnarly cases: frontmatter with quote characters in values, code fences containing what looks like a heading, JSX attributes spanning multiple lines.

The trade-offs

The package could grow to own embedding, retrieval, even reranking. It chooses not to. The benefit is sharp boundaries: a stranger evaluating the source can read it in fifteen minutes and see exactly what they're buying. The cost is that you write a little more glue in your application code to wire embeddings and retrieval on top. That trade is correct for the role: this is a sharp tool, not a framework, and most package-design failures come from adding surface before the second consumer asks.

ask kev-o