RAG Architecture
CAG Architecture
Side-by-Side
Retrieval-Augmented Generation (RAG)
OFFLINE
Knowledge Indexing Phase
Documents
PDFs, Word, Web Pages
→
Chunking
Split into 200-500 tokens
→
Embedding
Convert to vectors
→
Vector DB
Store embeddings
ONLINE
Query Processing Phase
User Query
Natural language question
→
Query Embedding
Vectorize question
→
Similarity Search
Find top 3-5 chunks
→
LLM Generation
Answer with context
Scalability
Millions of Docs
Latency
1-3 seconds
Updates
Real-time
Citations
Automatic
Cache-Augmented Generation (CAG)
PRE-LOAD
Knowledge Caching Phase (One-time)
Documents
All knowledge docs
→
Concatenation
Combine into prompt
→
Context Window
32k-100k tokens
→
KV Cache
Encoded knowledge
QUERY
Fast Response Phase
User Query
Natural language question
→
Append to Cache
Add query to KV cache
→
Single Forward Pass
No re-encoding
→
Fast Response
Answer in 300-500ms
Scalability
32k-100k Tokens
Latency
300-500ms
Updates
Cache Rebuild
Infrastructure
Simpler
Architecture Comparison
RAG
✅ Large Knowledge Bases
Scales to millions of documents
✅ Real-time Updates
Incremental index updates
✅ Automatic Citations
Source tracking built-in
❌ Higher Latency
Retrieval adds 1-2 seconds
❌ Complex Infrastructure
Requires vector database
CAG
✅ Ultra-Low Latency
300-500ms responses
✅ Simple Deployment
No external dependencies
✅ Lower Costs
No vector DB hosting
❌ Limited Scale
32k-100k token limit
❌ Update Overhead
Full cache rebuild needed
Best for: Enterprise AI, Dynamic Knowledge
Best for: Static Docs, Low Latency