What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

Splitting Techniques for RAG: The Art of the Right Chunk

In our last post, we built a RAG pipeline. The most important step, which we glossed over, was how we split our documents into chunks.

This step is the most critical part of the entire RAG system.

The fundamental problem: Bad Chunks = Bad Retrieval = Bad Answer

Think about it: The Retriever's only job is to find the best chunks. If our chunks are bad, the Retriever will fail, and the LLM will give a bad answer.

Imagine a textbook where every paragraph is cut in half and randomly stitched to another half-paragraph. Finding a coherent answer would be impossible, no matter how smart the student is.

The quality of your chunks determines the quality of your retrieval, which determines the quality of your final answer. A bad splitting strategy will poison your entire pipeline.

Technique 1: The naive approach (Fixed-size splitting)

The simplest way to chunk a document is to just count a fixed number of characters (e.g., 150) and then split. We can also add an "overlap" to repeat a few characters, hoping to keep some context.

Let's see why this is often a bad idea.

# This splitter just counts characters.
# It doesn't understand words or sentences.

fixed_splitter = CharacterTextSplitter(
    separator=" ",    # Just split on spaces
    chunk_size=150,   # Count to 150 characters
    chunk_overlap=20  # Repeat 20 characters in the next chunk
)

text = "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back."

chunks = fixed_splitter.split_text(text)

Resulting Chunks:

Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like ba

Chunk 2: mour-like back.

Notice the disaster? It brutally cut the word "back" in half. The semantic meaning is completely broken. An LLM can't make sense of "ba... mour-like back."

Technique 2: A smarter default (Recursive splitting)

A much better approach is to try splitting intelligently. We give the splitter a list of "separators" to try in order of priority.

The default list is often:

Try to split on paragraphs (\n\n)
If a chunk is still too big, try to split on sentences (.)
If still too big, try to split on lines (\n)
As a last resort, split on words ( )

# This splitter is "structure-aware"
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,    # Still a 150-char LIMIT
    chunk_overlap=20
)

text = "One morning... armour-like back.\n\nHis room... travelling salesman."

chunks = recursive_splitter.split_text(text)

Resulting Chunks:

Chunk 1: One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back.

Chunk 2: His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collection of textile samples lay spread out on the table - Samsa was a travelling salesman.

This is much better! The splitter saw the \n\n (paragraph break) and respected it. It created two perfect, semantically complete chunks. This is the best "default" strategy.

Technique 3: The advanced approach (Semantic chunking)

What if we could split the document based on changes in topic? This is the idea behind semantic chunking. We create chunks by grouping sentences that are semantically similar.

Here's the logic:

graph TD
    A[Split text into sentences] --> B(Embed each sentence)
    B --> C{"Calculate similarity <br/> between Sentence 1 and 2"}
    C --> D{"Is similarity low?"}
    D -- Yes --> E[Start a NEW chunk]
    D -- No --> F["Keep sentences in the <br/> SAME chunk"]
    F --> G{"Calculate similarity <br/> between Sentence 2 and 3"}
    G --> D
    E --> G

Let's try this on a text with a clear topic shift:

Text:

"The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel...

At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."

A semantic chunker would analyze the sentences and find that the similarity between "Gustave Eiffel" and "At night... light show" is very low. It correctly identifies this as a topic break.

Resulting Chunks:

Chunk 1: (All about the tower's history and construction) "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris... It is named after the engineer Gustave Eiffel..."

Chunk 2: (All about the tower's lighting system) "At night, the Eiffel Tower is illuminated by a dazzling light show. Every evening, the tower sparkles for five minutes every hour..."

This is an incredibly powerful technique for long documents, as it creates chunks that are perfectly coherent and focused on a single topic.

Technique 4: Content-aware splitting (e.g., markdown)

Finally, the best splitters understand the type of content they are reading. If you're chunking computer code, you should split by functions or classes, not by paragraphs.

If you're chunking Markdown (like this blog post), you should split by headings (#) to keep sections together.

Text:

# Understanding LLMs

Large Language Models (LLMs) are a type of AI.

## Key Features

They are trained on vast amounts of text data.

They can generate human-like text.

Result with a Recursive Splitter (Bad):

Chunk 1: # Understanding LLMs

Large Language Models (LLMs) are a type of AI.

## Key

Chunk 2: Features

They are trained on vast...

Result with a Markdown-Aware Splitter (Good):

Chunk 1: # Understanding LLMs

Large Language Models (LLMs) are a type of AI.

Chunk 2: ## Key Features

They are trained on vast amounts of text data.

They can generate human-like text.

The Markdown-aware splitter knows that headings define new sections and intelligently keeps the heading and its content together in the same chunk.

Key takeaways

Chunking is foundational: Your RAG system is only as good as its chunks. "Garbage In, Garbage Out."
Fixed-size is risky: Simple character-based splitting will break words and sentences. Avoid it.
Recursive is the best default: The RecursiveCharacterTextSplitter is a robust and smart choice for most plain text.
Content-aware is best: For maximum quality, use a splitter that understands your content's structure (like Markdown, Code, or Semantic topics).

For more on building production AI systems, check out our AI Engineering Bootcamp.

Splitting Techniques for RAG: The Art of the Right Chunk

Share this post

The fundamental problem: Bad Chunks = Bad Retrieval = Bad Answer

Technique 1: The naive approach (Fixed-size splitting)

Technique 2: A smarter default (Recursive splitting)

Technique 3: The advanced approach (Semantic chunking)

Technique 4: Content-aware splitting (e.g., markdown)

Key takeaways

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

Splitting Techniques for RAG: The Art of the Right Chunk

Share this post

The fundamental problem: Bad Chunks = Bad Retrieval = Bad Answer

Technique 1: The naive approach (Fixed-size splitting)

Technique 2: A smarter default (Recursive splitting)

Technique 3: The advanced approach (Semantic chunking)

Technique 4: Content-aware splitting (e.g., markdown)

Key takeaways

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Browser Automation: Building Agents That See and Click

Weekly Bytes of AI