Recursive Text Chunker (RAG)
Split long documents into overlapping chunks for RAG (Retrieval-Augmented Generation). Implements LangChain-style recursive character splitting: try paragraph break → sentence → word → character, preferring natural boundaries. Configurable chunk size, overlap, and separator list.
How to use the Recursive Text Chunker (RAG)
Paste a document. Pick a chunk size (500-1500 chars is typical for RAG; smaller = more precise retrieval but more chunks to manage) and overlap (50-100 chars helps preserve context across boundaries). The recursive splitter prefers paragraph breaks over sentence breaks over word breaks — chunks end naturally where possible.
Chunk size strategy
Optimal chunk size depends on your retrieval pattern. For embedding-based retrieval where you'll feed top-3 chunks to a small context window: 300-500 chars keeps each chunk semantically focused. For passing top-1 chunk to a long-context model: 1500-2500 chars gives more context per match.
Overlap is insurance against information getting cut at a chunk boundary. 10-20% of chunk size is typical. Too much overlap inflates index size and increases the chance of retrieving redundant chunks.