Splitting Techniques for RAG: The Art of the Right Chunk
In our last post, we built a RAG pipeline. The most important step, which we glossed over, was how we split our documents into chunks.
This step is the most critical part of the entire RAG system.
The fundamental problem: Bad Chunks = Bad Retrieval = Bad Answer
Think about it: The Retriever's only job is to find the best chunks. If our chunks are bad, the Retriever will fail, and the LLM will give a bad answer.
Imagine a textbook where every paragraph is cut in half and randomly stitched to another half-paragraph. Finding a coherent answer would be impossible, no matter how smart the student is.
The quality of your chunks determines the quality of your retrieval, which determines the quality of your final answer. A bad splitting strategy will poison your entire pipeline.
Technique 1: The naive approach (Fixed-size splitting)
The simplest way to chunk a document is to just count a fixed number of characters (e.g., 150) and then split. We can also add an "overlap" to repeat a few characters, hoping to keep some context.
Let's see why this is often a bad idea.
# This splitter just counts characters.
# It doesn't understand words or sentences.
fixed_splitter = CharacterTextSplitter(
separator=" ", # Just split on spaces
chunk_size=150, # Count to 150 characters
chunk_overlap=20 # Repeat 20 characters in the next chunk
)
text = "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back."
chunks = fixed_splitter.split_text(text)
Resulting Chunks:
Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like ba
Chunk 2: mour-like back.
Notice the disaster? It brutally cut the word "back" in half. The semantic meaning is completely broken. An LLM can't make sense of "ba... mour-like back."
Technique 2: A smarter default (Recursive splitting)
A much better approach is to try splitting intelligently. We give the splitter a list of "separators" to try in order of priority.
The default list is often:
- Try to split on paragraphs (
\n\n) - If a chunk is still too big, try to split on sentences (
.) - If still too big, try to split on lines (
\n) - As a last resort, split on words (
)
# This splitter is "structure-aware"
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=150, # Still a 150-char LIMIT
chunk_overlap=20
)
text = "One morning... armour-like back.\n\nHis room... travelling salesman."
chunks = recursive_splitter.split_text(text)
Resulting Chunks:
Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back.
Chunk 2: His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collection of textile samples lay spread out on the table - Samsa was a travelling salesman.
This is much better! The splitter saw the \n\n (paragraph break) and respected it. It created two perfect, semantically complete chunks. This is the best "default" strategy.
Technique 3: The advanced approach (Semantic chunking)
What if we could split the document based on changes in topic? This is the idea behind semantic chunking. We create chunks by grouping sentences that are semantically similar.
Here's the logic:
graph TD
A[Split text into sentences] --> B(Embed each sentence)
B --> C{"Calculate similarity <br/> between Sentence 1 and 2"}
C --> D{"Is similarity low?"}
D -- Yes --> E[Start a NEW chunk]
D -- No --> F["Keep sentences in the <br/> SAME chunk"]
F --> G{"Calculate similarity <br/> between Sentence 2 and 3"}
G --> D
E --> G
Let's try this on a text with a clear topic shift:
Text:
"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel...
At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."
A semantic chunker would analyze the sentences and find that the similarity between "Gustave Eiffel" and "At night... light show" is very low. It correctly identifies this as a topic break.
Resulting Chunks:
Chunk 1: (All about the tower's history and construction) "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel..."
Chunk 2: (All about the tower's lighting system) "At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."
This is an incredibly powerful technique for long documents, as it creates chunks that are perfectly coherent and focused on a single topic.
Technique 4: Content-aware splitting (e.g., markdown)
Finally, the best splitters understand the type of content they are reading. If you're chunking computer code, you should split by functions or classes, not by paragraphs.
If you're chunking Markdown (like this blog post), you should split by headings (#) to keep sections together.
Text:
# Understanding LLMs
Large Language Models (LLMs) are a type of AI.
## Key Features
- They are trained on vast amounts of text data.
- They can generate human-like text.
Result with a Recursive Splitter (Bad):
Chunk 1: # Understanding LLMs
Large Language Models (LLMs) are a type of AI.
## Key
Chunk 2: Features
- They are trained on vast...
Result with a Markdown-Aware Splitter (Good):
Chunk 1: # Understanding LLMs
Large Language Models (LLMs) are a type of AI.
Chunk 2: ## Key Features
- They are trained on vast amounts of text data.
- They can generate human-like text.
The Markdown-aware splitter knows that headings define new sections and intelligently keeps the heading and its content together in the same chunk.
Key takeaways
- Chunking is foundational: Your RAG system is only as good as its chunks. "Garbage In, Garbage Out."
- Fixed-size is risky: Simple character-based splitting will break words and sentences. Avoid it.
- Recursive is the best default: The
RecursiveCharacterTextSplitteris a robust and smart choice for most plain text. - Content-aware is best: For maximum quality, use a splitter that understands your content's structure (like Markdown, Code, or Semantic topics).
For more on building production AI systems, check out our AI Engineering Bootcamp.