RAG Workflow
A RAG system has two distinct phases: indexing (preparation) and querying (inference). Understanding this workflow is essential for building effective applications.
Phase 1: Indexing (Offline)
1. Load documents: Read PDFs, text files, web pages, databases.
2. Chunk: Split long documents into smaller pieces (e.g., 500‑1000 characters).
3. Generate embeddings: Convert each chunk into a vector (embedding) using an embedding model.
4. Store in vector database: Save chunk text + embedding + metadata for fast retrieval.
This is done once (or periodically updated).
Phase 2: Querying (Inference)
1. User question: e.g., "What is RAG?"
2. Generate query embedding: Use the same embedding model as indexing.
3. Retrieve relevant chunks: Perform similarity search in the vector database (e.g., cosine similarity).
4. Augment prompt: Combine the original question with the retrieved chunks into a prompt (e.g., "Context: ... Question: ...").
5. Generate answer: Send the augmented prompt to an LLM.
6. Return answer: Optionally include source citations.
Indexing: Docs → Chunks → Embeddings → Vector DB
Querying: Question → Embedding → Retrieve → Augment → LLM → AnswerWhy Separate Phases?
Indexing is compute‑intensive but done rarely. Querying must be fast (sub‑second). Separating them allows scaling independently.
Two Minute Drill
- Indexing: load, chunk, embed, store.
- Querying: embed question, retrieve, augment, generate.
- Vector database stores embeddings for fast similarity search.
- Indexing is offline; querying is online and must be fast.
Need more clarification?
Drop us an email at career@quipoinfotech.com
