recursive-char (default)। पहले paragraph breaks पर split करने की कोशिश करता है, फिर line breaks, फिर sentence boundaries ( . ), फिर spaces। हर piece chunk-size budget तक greedily packed होती है। केवल तब hard character split पर fall back होता है जब और कुछ fit नहीं होता। LangChain, LlamaIndex, और अधिकांश production RAG stacks में default, और लगभग किसी भी prose corpus के लिए sensible starting point। sentence । Sentence boundaries पर split करता है और कभी sentence के अंदर नहीं — भले ही sentence budget से छोटी हो, यह whole रहती है। Recursive-char से बेहतर जब chunks को evidence के रूप में quote किया जाएगा (QA, citations) क्योंकि कोई chunk mid-thought end नहीं होता। Chunk sizes अधिक variable हैं, क्योंकि sentence lengths हैं। paragraph । हर paragraph को unit के रूप में treat करता है। दो adjacent short paragraphs एक साथ packed हो सकते हैं, लेकिन एक long paragraph कभी split नहीं होता। Documentation, knowledge-base articles, और well-formatted long-form के लिए useful जहाँ हर paragraph एक coherent thought है। semantic । Paragraph के ऊपर heading detection add करता है: # , ## आदि से शुरू होने वाली lines, या short all-caps lines, new-section breaks के रूप में treat होती हैं। Technical documentation के लिए अच्छा जहाँ section boundaries visual paragraph spacing से अधिक matter करती हैं।

यह एक heuristic है। हम English-ish text के लिए chars / 3.8 और CJK के लिए per character 1 token use करते हैं। Prose के लिए यह ±5–10% सही है, code-heavy या structured text के लिए worse। Real tokenizer क्यों नहीं? tiktoken ~1 MB का WASM है, सिर्फ chunk-sizing के लिए ship करने के लिए way too heavy। अगर आपके downstream model के पास chunks पर hard token limit है (कुछ के पास है, उदा. cohere/embed-v3 512 पर capped है), एक small safety margin के साथ build करें या अपने pipeline में real tokenizer run करें। यह sizing के लिए fine है। हर chunk का exact token count RAG quality के लिए बहुत matter नहीं करता — embedding model oversized inputs को वैसे भी truncate कर देता है। जो matter करता है वो consistency है: chunking और downstream prompts के budgeting के लिए same heuristic।

JSON या CSV जैसे structured data को text chunker से chunk न करें। Chunk boundaries records के middle में fall होंगी और embeddings meaningless होंगे। Records के boundaries पर पहले split करें, या एक tool-specific RAG approach use करें। Code को prose chunker से chunk न करें। Code retrieval के लिए function boundaries matter करती हैं, character counts नहीं। इसके लिए Tree-sitter-based chunkers exist करते हैं। Whitespace और BOM। Pasted text hidden whitespace carry कर सकता है जो token estimates throw off करता है। अगर matter करता है तो paste से पहले trim और normalise करें। Privacy। सब browser में run होता है। यहाँ pasted documents कभी page से बाहर नहीं जाते; आप इसका use confidential या PII-containing material के लिए कर सकते हैं जैसे एक local script use करते।

RAG Text Chunker

RAG / embeddings prep के लिए text को token-sized chunks में split करें। कई strategies: recursive char, sentence-aware, semantic boundaries। Configurable overlap। सब browser में।

Paste your document

RAG (Retrieval-Augmented Generation) is a technique that pairs an LLM with an external knowledge base. Instead of relying only on the model's parametric memory, the system retrieves relevant passages at query time and concatenates them into the prompt.

The retrieval step depends on a vector store. Documents are split into chunks, each chunk gets an embedding vector, and those vectors are stored in a database that supports approximate nearest-neighbour search.

# Chunk size matters

Chunks that are too small lose context — a single sentence might not contain enough information to answer the question. Chunks that are too large dilute relevance — the embedding becomes an average of many unrelated topics, and retrieval gets noisy.

A common starting point is 500 tokens per chunk with 50 tokens of overlap. Overlap lets a chunk see a bit of context from its neighbours, which helps when a relevant span happens to fall across a chunk boundary.

# Strategy choice

Recursive character splitting is the most general: it tries paragraph breaks first, then line breaks, then sentence boundaries, then spaces. This is the default in most frameworks and works well for prose.

Sentence-aware splitting respects sentence boundaries and never breaks mid-sentence. It produces cleaner chunks for QA over articles and books, at the cost of more variable chunk sizes.

Paragraph chunking treats each paragraph as its own chunk (subject to the size limit). This is useful for structured documents like documentation or knowledge-base articles where each section is already a coherent unit.

Semantic chunking goes one step further: it tries to keep semantically related content together by respecting headings and topic boundaries, even when those don't align with paragraph breaks.

Chunk size (tokens): 500

Overlap (tokens): 50

Strategy

यह किसके लिए है?

RAG (Retrieval-Augmented Generation) और embedding-based search दोनों एक corpus को chunks में split करने पर निर्भर हैं: छोटे pieces जिन्हें individually embed किया जाता है और एक vector database में store किया जाता है। Split किसी भी AI machinery के चलने से पहले होता है, लेकिन आपके retrieval की quality चुपचाप इस पर अधिक depend करती है जितना अधिकांश लोगों को realise होता है। बहुत छोटे chunks context lose करते हैं; बहुत बड़े chunks relevance dilute करते हैं; mid-sentence split किए गए chunks poorly retrieve करते हैं क्योंकि embedding एक अजीब semantic spot पर land करता है। यह tool एक fast in-browser playground देता है ताकि आप chunk size, overlap, और strategy के साथ experiment कर सकें इससे पहले कि आप एक pipeline को choice के लिए commit करें।

चार strategies

recursive-char (default)। पहले paragraph breaks पर split करने की कोशिश करता है, फिर line breaks, फिर sentence boundaries (. ), फिर spaces। हर piece chunk-size budget तक greedily packed होती है। केवल तब hard character split पर fall back होता है जब और कुछ fit नहीं होता। LangChain, LlamaIndex, और अधिकांश production RAG stacks में default, और लगभग किसी भी prose corpus के लिए sensible starting point।
sentence। Sentence boundaries पर split करता है और कभी sentence के अंदर नहीं — भले ही sentence budget से छोटी हो, यह whole रहती है। Recursive-char से बेहतर जब chunks को evidence के रूप में quote किया जाएगा (QA, citations) क्योंकि कोई chunk mid-thought end नहीं होता। Chunk sizes अधिक variable हैं, क्योंकि sentence lengths हैं।
paragraph। हर paragraph को unit के रूप में treat करता है। दो adjacent short paragraphs एक साथ packed हो सकते हैं, लेकिन एक long paragraph कभी split नहीं होता। Documentation, knowledge-base articles, और well-formatted long-form के लिए useful जहाँ हर paragraph एक coherent thought है।
semantic। Paragraph के ऊपर heading detection add करता है: #, ## आदि से शुरू होने वाली lines, या short all-caps lines, new-section breaks के रूप में treat होती हैं। Technical documentation के लिए अच्छा जहाँ section boundaries visual paragraph spacing से अधिक matter करती हैं।

Overlap — क्यों और कितना

क्यों। अगर एक relevant span exactly दो chunks के boundary पर fall होता है, दोनों chunks poorly retrieve करेंगे क्योंकि हर एक के पास सिर्फ half है। Overlap previous chunk के last N tokens को next chunk के front पर copy करता है, दोनों को span capture करने का chance देता है।
कितना। Chunk size का 10–20% common rule of thumb है। 500-token chunks के लिए, 50–100 tokens overlap। Default यहाँ 50 है, practice में अच्छा work करता है।
Trade-off। ज़्यादा overlap = ज़्यादा total embeddings (ज़्यादा storage, ज़्यादा cost, slower index build, ज़्यादा retrieval candidates)। Specific reason के बिना 30% से ज़्यादा push न करें; आप पैसा waste करेंगे retrieval को बहुत improve किए बिना।

Token estimation

यह एक heuristic है। हम English-ish text के लिए chars / 3.8 और CJK के लिए per character 1 token use करते हैं। Prose के लिए यह ±5–10% सही है, code-heavy या structured text के लिए worse।
Real tokenizer क्यों नहीं? tiktoken ~1 MB का WASM है, सिर्फ chunk-sizing के लिए ship करने के लिए way too heavy। अगर आपके downstream model के पास chunks पर hard token limit है (कुछ के पास है, उदा. cohere/embed-v3 512 पर capped है), एक small safety margin के साथ build करें या अपने pipeline में real tokenizer run करें।
यह sizing के लिए fine है। हर chunk का exact token count RAG quality के लिए बहुत matter नहीं करता — embedding model oversized inputs को वैसे भी truncate कर देता है। जो matter करता है वो consistency है: chunking और downstream prompts के budgeting के लिए same heuristic।

Common gotchas

JSON या CSV जैसे structured data को text chunker से chunk न करें। Chunk boundaries records के middle में fall होंगी और embeddings meaningless होंगे। Records के boundaries पर पहले split करें, या एक tool-specific RAG approach use करें।
Code को prose chunker से chunk न करें। Code retrieval के लिए function boundaries matter करती हैं, character counts नहीं। इसके लिए Tree-sitter-based chunkers exist करते हैं।
Whitespace और BOM। Pasted text hidden whitespace carry कर सकता है जो token estimates throw off करता है। अगर matter करता है तो paste से पहले trim और normalise करें।
Privacy। सब browser में run होता है। यहाँ pasted documents कभी page से बाहर नहीं जाते; आप इसका use confidential या PII-containing material के लिए कर सकते हैं जैसे एक local script use करते।