Retrieval-Augmented Generation (RAG)

OFFLINE Knowledge Indexing Phase
Storage
Documents
PDFs, Word, Web Pages
Processing
Chunking
Split into 200-500 tokens
AI
Embedding
Convert to vectors
Database
Vector DB
Store embeddings
ONLINE Query Processing Phase
Query
User Query
Natural language question
Embed
Query Embedding
Vectorize question
Search
Similarity Search
Find top 3-5 chunks
LLM
LLM Generation
Answer with context
Scalability
Millions of Docs
Latency
1-3 seconds
Updates
Real-time
Citations
Automatic

Cache-Augmented Generation (CAG)

PRE-LOAD Knowledge Caching Phase (One-time)
Storage
Documents
All knowledge docs
Concat
Concatenation
Combine into prompt
Model
Context Window
32k-100k tokens
Cache
KV Cache
Encoded knowledge
QUERY Fast Response Phase
Query
User Query
Natural language question
Cache
Append to Cache
Add query to KV cache
Generate
Single Forward Pass
No re-encoding
Response
Fast Response
Answer in 300-500ms
Scalability
32k-100k Tokens
Latency
300-500ms
Updates
Cache Rebuild
Infrastructure
Simpler

Architecture Comparison

RAG

✅ Large Knowledge Bases
Scales to millions of documents
✅ Real-time Updates
Incremental index updates
✅ Automatic Citations
Source tracking built-in
❌ Higher Latency
Retrieval adds 1-2 seconds
❌ Complex Infrastructure
Requires vector database

CAG

✅ Ultra-Low Latency
300-500ms responses
✅ Simple Deployment
No external dependencies
✅ Lower Costs
No vector DB hosting
❌ Limited Scale
32k-100k token limit
❌ Update Overhead
Full cache rebuild needed
Best for: Enterprise AI, Dynamic Knowledge
Best for: Static Docs, Low Latency