Chunking Strategies

After loading documents, you must split them into smaller chunks. Chunking ensures that retrieved text fits within the LLM's context window and that each chunk is semantically coherent.

Fixed‑Size Chunking

The simplest method: split by a fixed number of characters. Use `RecursiveCharacterTextSplitter`, which tries to split at natural boundaries (paragraphs, sentences, words).

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

`chunk_size` = max characters per chunk. `chunk_overlap` = overlap between consecutive chunks (to preserve context).

Semantic Chunking

Uses embeddings to split where there is a semantic shift. More advanced and computationally expensive but yields more meaningful chunks. Implemented in `SemanticChunker` (LangChain experimental).

Markdown / HTML Splitting

Preserves structure (headings, lists). Use `MarkdownHeaderTextSplitter` or `HTMLHeaderTextSplitter`.

Choosing Chunk Size

Small chunks (100‑200 chars): precise but may lose context.
Large chunks (1000‑2000 chars): more context but may include irrelevant information.
Typical starting point: 500 chars with 50 overlap.

Two Minute Drill

Chunking splits long documents into smaller pieces.
`RecursiveCharacterTextSplitter` is the standard choice.
Overlap helps maintain context across chunks.
Semantic chunking is more advanced but can improve retrieval.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Chunking Strategies

Need more clarification?