RAG Workflow

A RAG system has two distinct phases: indexing (preparation) and querying (inference). Understanding this workflow is essential for building effective applications.

Phase 1: Indexing (Offline)

1. Load documents: Read PDFs, text files, web pages, databases.
2. Chunk: Split long documents into smaller pieces (e.g., 500‑1000 characters).
3. Generate embeddings: Convert each chunk into a vector (embedding) using an embedding model.
4. Store in vector database: Save chunk text + embedding + metadata for fast retrieval.

This is done once (or periodically updated).

Phase 2: Querying (Inference)

1. User question: e.g., "What is RAG?"
2. Generate query embedding: Use the same embedding model as indexing.
3. Retrieve relevant chunks: Perform similarity search in the vector database (e.g., cosine similarity).
4. Augment prompt: Combine the original question with the retrieved chunks into a prompt (e.g., "Context: ... Question: ...").
5. Generate answer: Send the augmented prompt to an LLM.
6. Return answer: Optionally include source citations.

Indexing: Docs → Chunks → Embeddings → Vector DB
Querying: Question → Embedding → Retrieve → Augment → LLM → Answer

Why Separate Phases?

Indexing is compute‑intensive but done rarely. Querying must be fast (sub‑second). Separating them allows scaling independently.

Two Minute Drill

Indexing: load, chunk, embed, store.
Querying: embed question, retrieve, augment, generate.
Vector database stores embeddings for fast similarity search.
Indexing is offline; querying is online and must be fast.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

RAG Workflow

Need more clarification?