In any RAG system, the chunking strategy determines the ceiling of your retrieval quality. You can have the best embedding model and the fastest vector database, but if your chunks don't capture the right information at the right granularity, the LLM will generate responses based on irrelevant or fragmented context.

When building the RAG-powered IAM policy generator at EPAM, chunking was the component we iterated on most. This post documents our systematic evaluation of three chunking strategies across two real-world corpora, and the surprising results that challenged our assumptions.

The Three Strategies

1. Fixed-Size Chunking

The simplest approach: split text into chunks of exactly N tokens (or characters) with an overlap of M tokens between consecutive chunks.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,       # tokens per chunk
    chunk_overlap=50,     # overlap between consecutive chunks
    separator="\n"        # prefer splitting at newlines
)
chunks = splitter.split_text(document)

Pros: Predictable chunk sizes, simple to implement, consistent embedding vector quality (embedding models have optimal input lengths). Cons: Splits mid-sentence, mid-paragraph, even mid-concept. A policy definition spanning 600 tokens gets arbitrarily cut into two chunks, neither of which captures the complete policy.

2. Recursive Character Splitting

LangChain's recursive splitter tries a hierarchy of separators — first double newlines (paragraph boundaries), then single newlines, then sentences, then characters — to find natural break points:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document)

Pros: Respects document structure (paragraphs stay intact when possible). Much better than fixed-size for prose-heavy documents. Cons: Chunk sizes vary significantly (200-512 tokens). Short chunks embed poorly. Long chunks dilute relevance. Still structure-blind for technical documents with headers and code blocks.

3. Semantic Chunking

Uses the embedding model itself to determine chunk boundaries. Compute embeddings for each sentence, then measure cosine similarity between consecutive sentences. Split where similarity drops significantly — these are topic boundaries.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75    # split at bottom 25% similarity
)
chunks = splitter.split_text(document)

Pros: Chunks are semantically coherent — each chunk discusses one topic. Dramatically better for documents covering multiple topics in sequence. Cons: Slow (requires embedding every sentence), chunk sizes are unpredictable (could be 100 or 2000 tokens), and sensitive to the embedding model used for boundary detection.

Experimental Setup

Corpora

Evaluation

We created a test set of 200 questions per corpus with gold-standard retrieved chunks (annotated by domain experts). Metrics:

Results

Corpus A: AWS IAM Documentation (2,400 pages)
Strategy    | Chunk Size | Recall@5 | Answer Acc | Chunk Time | Total Chunks
Fixed-512   |    512     |  62.5%   |   58.0%    |   12 sec   |   18,400
Recursive   |  200-512   |  71.0%   |   67.5%    |   15 sec   |   16,200
Semantic    |  100-2000  |  68.5%   |   65.0%    |   45 min   |   12,800

Corpus B: Internal Knowledge Base (800 pages)
Strategy    | Chunk Size | Recall@5 | Answer Acc | Chunk Time | Total Chunks
Fixed-512   |    512     |  58.0%   |   54.5%    |    4 sec   |    6,100
Recursive   |  200-512   |  69.5%   |   66.0%    |    5 sec   |    5,400
Semantic    |  100-2000  |  74.0%   |   71.5%    |   18 min   |    4,200

Analysis: The Surprising Findings

Semantic chunking doesn't always win

On Corpus A (structured IAM docs), recursive chunking outperformed semantic chunking. Why? AWS documentation is already well-structured with clear section headers. Recursive splitting naturally follows this structure. Semantic chunking sometimes merged unrelated but superficially similar sections (e.g., two different AWS services both discussing "policy attachments").

On Corpus B (mixed knowledge base), semantic chunking won decisively. The prose-heavy content had topic shifts within long paragraphs that recursive splitting missed.

"Match your chunking strategy to your document structure. Structured docs (headers, sections, code blocks) → recursive splitting is sufficient and 100× faster. Unstructured prose → semantic chunking is worth the cost."

The overlap sweet spot

Overlap Ablation (Recursive, Corpus A):
Overlap | Recall@5 | Storage Overhead
  0     |  65.0%   |   baseline
 25     |  69.0%   |   +5%
 50     |  71.0%   |   +10%
100     |  71.5%   |   +20%
200     |  70.0%   |   +40%  (diminishing returns)

Sweet spot: 50 tokens overlap (~10% of chunk size)
Higher overlap increases storage and index size with minimal gain.

Chunk size matters more than strategy

Chunk Size Ablation (Recursive, Corpus A):
Size  | Recall@5 | Answer Acc
 128  |  53.0%   |   48.5%     (too short, no context)
 256  |  64.5%   |   60.0%
 512  |  71.0%   |   67.5%     (best balance)
 768  |  69.0%   |   66.0%     (diluted relevance)
1024  |  64.5%   |   62.0%     (too much noise in each chunk)

512 tokens is the sweet spot for these corpora.
This aligns with most embedding models' optimal input length.

The Reranker Effect

Adding a cross-encoder reranker (BGE-reranker-v2) after initial retrieval improved accuracy across all strategies:

Impact of Reranking (Corpus A, top-5 retrieval):
Strategy    | Recall@5 (no rerank) | Recall@5 (reranked) | Δ
Fixed-512   |       62.5%          |       72.0%         | +9.5%
Recursive   |       71.0%          |       79.5%         | +8.5%
Semantic    |       68.5%          |       78.0%         | +9.5%

The reranker closed the gap between strategies by ~50%.
With reranking, recursive (79.5%) nearly matches the best
possible setup (semantic + rerank at 78.0%).

Key insight: If you're going to use a reranker (and you should), the chunking strategy matters less. The reranker compensates for imperfect chunk boundaries by re-scoring based on the full query-chunk relationship.

Our Production Configuration

Based on these experiments, our production RAG pipeline for the IAM policy generator uses:

Production RAG Chunking Pipeline:

1. Document Preprocessing:
   - Strip boilerplate (nav, footer, breadcrumbs)
   - Extract code blocks separately (embed with code-specific model)
   - Preserve headers as metadata (used for filtering)

2. Chunking:
   - Strategy: RecursiveCharacterTextSplitter
   - chunk_size: 512 tokens
   - chunk_overlap: 50 tokens
   - Additional: prepend section header to each chunk for context

3. Embedding:
   - Model: BGE-large-en-v1.5 (1024-dim)
   - Code chunks: CodeBERT embeddings separately

4. Retrieval:
   - Initial: top-20 from ChromaDB (cosine similarity)
   - Reranking: BGE-reranker-v2, return top-5
   - Metadata filtering: filter by cloud provider before search

5. Result:
   - Recall@5: 79.5%
   - End-to-end policy accuracy: 86.4%

Lessons Learned

  1. Start with recursive, not semantic. Recursive chunking is 100× faster to iterate on and gives 90% of semantic chunking's quality for structured documents.
  2. Invest in reranking. A cross-encoder reranker added 8-10% recall improvement regardless of chunking strategy. It's the highest-ROI component after the initial chunking.
  3. Chunk size > chunk strategy. Getting from 128 to 512 tokens improved recall more than switching from fixed to semantic chunking. Tune chunk size first.
  4. Prepend metadata. Adding the section header to each chunk's text (e.g., "AWS IAM > Policy Types > Resource-based policies: ...") gave a free 3-4% retrieval boost.
  5. Separate code chunks. Code blocks embedded with prose models retrieve poorly. Use a code-specific embedding model for code chunks and merge results at retrieval time.
"Chunking is the unsexy part of RAG that determines 70% of your system's quality. Most teams spend weeks on prompt engineering and 10 minutes on chunking. Invert that ratio."