recursive-char (default). Tries to split on paragraph breaks first, then line breaks, then sentence boundaries ( . ), then spaces. Each piece is greedily packed up to the chunk-size budget. Falls back to a hard character split only when nothing else fits. This is the default in LangChain, LlamaIndex, and most production RAG stacks, and is a sensible starting point for almost any prose corpus. sentence . Splits at sentence boundaries and never inside a sentence — even if a sentence is shorter than the budget, it stays whole. Better than recursive-char when chunks will be quoted back as evidence (QA, citations) because no chunk ends mid-thought. Chunk sizes are more variable, since sentence lengths are. paragraph . Treats each paragraph as a unit. Two adjacent short paragraphs may be packed together, but a long paragraph is never split. Useful for documentation, knowledge-base articles, and well-formatted long-form where each paragraph is a coherent thought. semantic . Adds heading detection on top of paragraph: lines starting with # , ## etc., or short all-caps lines, are treated as new-section breaks. Good for technical documentation where section boundaries matter more than visual paragraph spacing.

It's a heuristic. We use chars / 3.8 for English-ish text and 1 token per character for CJK. That's good to ±5–10% for prose, worse for code-heavy or structured text. Why not the real tokenizer? tiktoken is ~1 MB of WASM, far too heavy to ship just for chunk-sizing. If your downstream model has a hard token limit on chunks (some do, e.g. cohere/embed-v3 is capped at 512), build with a small safety margin or run the real tokenizer in your pipeline. This is fine for sizing. The exact token count of each chunk doesn't matter much for RAG quality — the embedding model truncates oversized inputs anyway. What matters is consistency: same heuristic for chunking and for budgeting downstream prompts.

Don't chunk structured data like JSON or CSV with a text chunker. The chunk boundaries will fall in the middle of records and the embeddings will be meaningless. Either split on record boundaries first, or use a tool-specific RAG approach. Don't chunk code with a prose chunker. Function boundaries are what matter for code retrieval, not character counts. Tree-sitter-based chunkers exist for this. Whitespace and BOM. Pasted text can carry hidden whitespace that throws off token estimates. Trim and normalise before paste if it matters. Privacy. Everything runs in the browser. Documents pasted here never leave the page; you can use this for confidential or PII-containing material the same way you'd use a local script.

RAG Text Chunker

Split text into token-sized chunks for RAG / embeddings prep. Multiple strategies: recursive char, sentence-aware, semantic boundaries. Configurable overlap. All in-browser.

Paste your document

RAG (Retrieval-Augmented Generation) is a technique that pairs an LLM with an external knowledge base. Instead of relying only on the model's parametric memory, the system retrieves relevant passages at query time and concatenates them into the prompt.

The retrieval step depends on a vector store. Documents are split into chunks, each chunk gets an embedding vector, and those vectors are stored in a database that supports approximate nearest-neighbour search.

# Chunk size matters

Chunks that are too small lose context — a single sentence might not contain enough information to answer the question. Chunks that are too large dilute relevance — the embedding becomes an average of many unrelated topics, and retrieval gets noisy.

A common starting point is 500 tokens per chunk with 50 tokens of overlap. Overlap lets a chunk see a bit of context from its neighbours, which helps when a relevant span happens to fall across a chunk boundary.

# Strategy choice

Recursive character splitting is the most general: it tries paragraph breaks first, then line breaks, then sentence boundaries, then spaces. This is the default in most frameworks and works well for prose.

Sentence-aware splitting respects sentence boundaries and never breaks mid-sentence. It produces cleaner chunks for QA over articles and books, at the cost of more variable chunk sizes.

Paragraph chunking treats each paragraph as its own chunk (subject to the size limit). This is useful for structured documents like documentation or knowledge-base articles where each section is already a coherent unit.

Semantic chunking goes one step further: it tries to keep semantically related content together by respecting headings and topic boundaries, even when those don't align with paragraph breaks.

Chunk size (tokens): 500

Overlap (tokens): 50

Strategy

What is this for?

Retrieval-Augmented Generation (RAG) and embedding-based search both depend on splitting a corpus into chunks: small pieces that are individually embedded and stored in a vector database. The split happens before any of the AI machinery runs, but the quality of your retrieval quietly depends on it more than most people realise. Too-small chunks lose context; too-large chunks dilute relevance; chunks split mid-sentence retrieve poorly because the embedding lands in a weird semantic spot. This tool gives you a fast in-browser playground to experiment with chunk size, overlap, and strategy before you commit a pipeline to the choice.

The four strategies

recursive-char (default). Tries to split on paragraph breaks first, then line breaks, then sentence boundaries (. ), then spaces. Each piece is greedily packed up to the chunk-size budget. Falls back to a hard character split only when nothing else fits. This is the default in LangChain, LlamaIndex, and most production RAG stacks, and is a sensible starting point for almost any prose corpus.
sentence. Splits at sentence boundaries and never inside a sentence — even if a sentence is shorter than the budget, it stays whole. Better than recursive-char when chunks will be quoted back as evidence (QA, citations) because no chunk ends mid-thought. Chunk sizes are more variable, since sentence lengths are.
paragraph. Treats each paragraph as a unit. Two adjacent short paragraphs may be packed together, but a long paragraph is never split. Useful for documentation, knowledge-base articles, and well-formatted long-form where each paragraph is a coherent thought.
semantic. Adds heading detection on top of paragraph: lines starting with #, ## etc., or short all-caps lines, are treated as new-section breaks. Good for technical documentation where section boundaries matter more than visual paragraph spacing.

Overlap — why and how much

Why. If a relevant span happens to fall exactly across the boundary between two chunks, both chunks will retrieve poorly because each only has half. Overlap copies the last N tokens of the previous chunk onto the front of the next chunk, giving both a chance to capture the span.
How much. 10–20% of the chunk size is the common rule of thumb. For 500-token chunks, 50–100 tokens of overlap. The default here is 50, which works well in practice.
Trade-off. More overlap = more total embeddings (more storage, more cost, slower index build, more retrieval candidates). Don't push past 30% unless you have a specific reason; you'll waste money without improving retrieval much.

Token estimation

It's a heuristic. We use chars / 3.8 for English-ish text and 1 token per character for CJK. That's good to ±5–10% for prose, worse for code-heavy or structured text.
Why not the real tokenizer? tiktoken is ~1 MB of WASM, far too heavy to ship just for chunk-sizing. If your downstream model has a hard token limit on chunks (some do, e.g. cohere/embed-v3 is capped at 512), build with a small safety margin or run the real tokenizer in your pipeline.
This is fine for sizing. The exact token count of each chunk doesn't matter much for RAG quality — the embedding model truncates oversized inputs anyway. What matters is consistency: same heuristic for chunking and for budgeting downstream prompts.

Common gotchas

Don't chunk structured data like JSON or CSV with a text chunker. The chunk boundaries will fall in the middle of records and the embeddings will be meaningless. Either split on record boundaries first, or use a tool-specific RAG approach.
Don't chunk code with a prose chunker. Function boundaries are what matter for code retrieval, not character counts. Tree-sitter-based chunkers exist for this.
Whitespace and BOM. Pasted text can carry hidden whitespace that throws off token estimates. Trim and normalise before paste if it matters.
Privacy. Everything runs in the browser. Documents pasted here never leave the page; you can use this for confidential or PII-containing material the same way you'd use a local script.