Advanced Chunking
Basic fixed‑size chunking ignores document structure. Advanced chunking preserves semantic boundaries using markdown headers, code blocks, or semantic similarity.
Markdown Header Splitting
Preserves heading hierarchy. Chunks are split at headers, with metadata tracking the heading path.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
chunks = splitter.split_text(markdown_text)HTML Header Splitting
Similar for HTML documents (e.g., web pages).
Semantic Chunking
Uses embeddings to split where there is a significant semantic shift. More computationally expensive but yields more coherent chunks.
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(embeddings=embeddings, breakpoint_threshold_type="percentile")
chunks = splitter.split_documents(documents)Choosing a Strategy
- Markdown/HTML: Structured documents (docs, blogs).
- Semantic: Unstructured text where meaning shifts.
- Recursive character: Simple fallback.
Two Minute Drill
- Markdown/HTML splitting preserves document structure.
- Semantic chunking splits at meaning boundaries.
- Choose method based on document type.
- Better chunking = better retrieval.
Need more clarification?
Drop us an email at career@quipoinfotech.com
