RAG Text Chunker
Split text into token-sized chunks for RAG / embeddings prep. Multiple strategies: recursive char, sentence-aware, semantic boundaries. Configurable overlap. All in-browser.
What is this for?
Retrieval-Augmented Generation (RAG) and embedding-based search both depend on splitting a corpus into chunks: small pieces that are individually embedded and stored in a vector database. The split happens before any of the AI machinery runs, but the quality of your retrieval quietly depends on it more than most people realise. Too-small chunks lose context; too-large chunks dilute relevance; chunks split mid-sentence retrieve poorly because the embedding lands in a weird semantic spot. This tool gives you a fast in-browser playground to experiment with chunk size, overlap, and strategy before you commit a pipeline to the choice.
The four strategies
- recursive-char (default). Tries to split on paragraph breaks first, then line breaks, then sentence boundaries (
.), then spaces. Each piece is greedily packed up to the chunk-size budget. Falls back to a hard character split only when nothing else fits. This is the default in LangChain, LlamaIndex, and most production RAG stacks, and is a sensible starting point for almost any prose corpus. - sentence. Splits at sentence boundaries and never inside a sentence — even if a sentence is shorter than the budget, it stays whole. Better than recursive-char when chunks will be quoted back as evidence (QA, citations) because no chunk ends mid-thought. Chunk sizes are more variable, since sentence lengths are.
- paragraph. Treats each paragraph as a unit. Two adjacent short paragraphs may be packed together, but a long paragraph is never split. Useful for documentation, knowledge-base articles, and well-formatted long-form where each paragraph is a coherent thought.
- semantic. Adds heading detection on top of paragraph: lines starting with
#,##etc., or short all-caps lines, are treated as new-section breaks. Good for technical documentation where section boundaries matter more than visual paragraph spacing.
Overlap — why and how much
- Why. If a relevant span happens to fall exactly across the boundary between two chunks, both chunks will retrieve poorly because each only has half. Overlap copies the last N tokens of the previous chunk onto the front of the next chunk, giving both a chance to capture the span.
- How much. 10–20% of the chunk size is the common rule of thumb. For 500-token chunks, 50–100 tokens of overlap. The default here is 50, which works well in practice.
- Trade-off. More overlap = more total embeddings (more storage, more cost, slower index build, more retrieval candidates). Don't push past 30% unless you have a specific reason; you'll waste money without improving retrieval much.
Token estimation
- It's a heuristic. We use chars / 3.8 for English-ish text and 1 token per character for CJK. That's good to ±5–10% for prose, worse for code-heavy or structured text.
- Why not the real tokenizer? tiktoken is ~1 MB of WASM, far too heavy to ship just for chunk-sizing. If your downstream model has a hard token limit on chunks (some do, e.g. cohere/embed-v3 is capped at 512), build with a small safety margin or run the real tokenizer in your pipeline.
- This is fine for sizing. The exact token count of each chunk doesn't matter much for RAG quality — the embedding model truncates oversized inputs anyway. What matters is consistency: same heuristic for chunking and for budgeting downstream prompts.
Common gotchas
- Don't chunk structured data like JSON or CSV with a text chunker. The chunk boundaries will fall in the middle of records and the embeddings will be meaningless. Either split on record boundaries first, or use a tool-specific RAG approach.
- Don't chunk code with a prose chunker. Function boundaries are what matter for code retrieval, not character counts. Tree-sitter-based chunkers exist for this.
- Whitespace and BOM. Pasted text can carry hidden whitespace that throws off token estimates. Trim and normalise before paste if it matters.
- Privacy. Everything runs in the browser. Documents pasted here never leave the page; you can use this for confidential or PII-containing material the same way you'd use a local script.