RAG — Retrieval-Augmented Generation
Give an LLM an "open-book exam". Learn how RAG grounds models in external data, the retrieve-augment-generate pipeline, vector databases (Chroma, FAISS) and embedding models.
In this lecture
16.1 Why RAG? Beyond Model Knowledge
LLMs are powerful but have real limitations:
- Frozen in time — they only know information up to their training cutoff; they cannot answer about recent events.
- Hallucinations — they confidently produce wrong answers.
- No private knowledge — they do not know your company rules or personal documents.
- No sources — they cannot show where information came from, so it is hard to verify.
What RAG fixes
- Gives more accurate, up-to-date answers.
- Reduces hallucinations by grounding answers in real, trusted data.
- Handles specific/private topics using your own documents.
- Enables citations — you can show the source of the answer.
16.2 How RAG Works — the Pipeline
- User Query — the user asks a question.
- Embedding creation — the query is converted into a vector.
- Retrieve — the system searches a vector database for the most relevant document chunks (by vector similarity).
- Rank & Select — the best-matching chunks are scored and the top ones chosen.
- Augment — the retrieved chunks are added to the prompt as "context".
- Generate — the LLM produces the answer using the retrieved context + the query.
16.3 RAG Components
| Component | Role |
|---|---|
| External Data | New data not in the LLM's training — from APIs, databases, document repositories |
| Retriever | Converts the query to a vector and finds the most relevant chunks in the vector DB |
| Ranker | Scores the retrieved chunks so the most relevant appear first, then adds them to the prompt |
| Generator (LLM) | Combines retrieved context + the query to produce the final answer |
Chunking
Large documents are split into small chunks before being embedded, so retrieval is precise — you fetch just the relevant paragraph, not a whole 50-page PDF. Updates can be real-time (added immediately) or batch (periodic).
16.4 Embeddings & Vector Databases
Embeddings — the key to retrieval
ML models cannot understand raw text, so everything is converted into embeddings — vectors of numbers that represent meaning in a multi-dimensional space. Texts with similar meaning have vectors that are close together.
Vector Databases
| Vector DB | Notes |
|---|---|
| Chroma | Lightweight, open-source, runs locally with no server — great for development & learning |
| FAISS | Facebook AI Similarity Search — a high-performance library for fast similarity search at scale |
| Pinecone / Weaviate | Managed/cloud vector databases for production |
Embedding models
An embedding model is the neural network that converts text into vectors. Common choices: all-MiniLM-L6-v2 (a small, free, fast open model producing 384-dimensional vectors) or commercial APIs like OpenAI's text-embedding-3-small. The same embedding model must be used for both the documents and the query.
16.5 Building a RAG Pipeline
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["How do I return a product?",
"What is the refund policy?", # similar meaning
"What time does the store open?"] # different topic
emb = model.encode(sentences)
sim = cosine_similarity(emb)
print("A vs B:", round(sim[0][1], 3)) # similar
print("A vs C:", round(sim[0][2], 3)) # different
import chromadb
client = chromadb.Client()
collection = client.create_collection("company_docs")
# Add documents - Chroma embeds them automatically
collection.add(
documents=[
"Employees get 18 days of earned leave per year.",
"Remote work is allowed on Fridays with manager approval.",
"Health insurance covers the employee and 2 dependents."
],
ids=["doc1", "doc2", "doc3"]
)
# Search by MEANING, not keywords
results = collection.query(
query_texts=["How many vacation days do I get?"],
n_results=1
)
print(results["documents"][0])
Note: the query said "vacation days" but the document says "earned leave" — yet RAG found it, because they have similar embeddings.
def ask_rag(question):
# 1. RETRIEVE - find relevant chunks
results = collection.query(query_texts=[question], n_results=2)
context = "\n".join(results["documents"][0])
# 2. AUGMENT - build the prompt with retrieved context
prompt = f"""Answer using ONLY the context below.
If the answer is not in the context, say "I don't have that information."
Context: {context}
Question: {question}
Answer:"""
# 3. GENERATE - send to the LLM
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0) # low temp -> factual
return response.choices[0].message.content
print(ask_rag("How many leave days do I get per year?"))
RAG architecture and vector databases are core exam material.
RAG stands for:
RAG = Retrieval-Augmented Generation — retrieve relevant data, augment the prompt with it, then generate the answer.
The main problem RAG solves that an LLM alone cannot is:
LLMs are frozen in time and lack private data. RAG retrieves external documents so answers are current, grounded and citable — reducing hallucinations.
What is the correct order of the RAG pipeline?
First retrieve relevant documents, then augment the prompt with them as context, then generate the answer using that context.
Which of these is a vector database?
Chroma (and FAISS, Pinecone, Weaviate) are vector databases that store embeddings and find similar items by vector comparison. MySQL is a relational DB.
Why does RAG store documents as embeddings rather than raw text?
Embeddings capture semantic meaning, so a query for "vacation days" can match a document about "earned leave" — keyword search would miss it.
In a RAG system, which component reduces hallucinations the most?
Retrieval supplies real, trusted documents as context, so the generator answers from facts rather than inventing them.
A cosine similarity of 1.0 between two text embeddings means the texts are:
Cosine similarity ranges 0→1: 1 means the vectors point the same way (identical meaning), 0 means unrelated.
An advantage of RAG over fine-tuning for factual knowledge is that RAG:
RAG knowledge lives in a database — update it any time with no costly retraining. Fine-tuning bakes knowledge into the weights and needs retraining to change.
Explain the three stages of RAG (Retrieve, Augment, Generate) in one sentence each.
Retrieve: the user's query is embedded and the vector database is searched for the most relevant document chunks. Augment: those retrieved chunks are inserted into the prompt as context. Generate: the LLM produces the final answer using that provided context together with the question.
Write code to create a Chroma collection, add three documents to it, and query for the most relevant one.
import chromadb
client = chromadb.Client()
collection = client.create_collection("kb")
collection.add(
documents=["Refunds take 5 business days.",
"The store opens at 9 AM.",
"Shipping is free over $50."],
ids=["d1", "d2", "d3"]
)
result = collection.query(
query_texts=["When will I get my money back?"],
n_results=1
)
print(result["documents"][0])
"money back" matched the refund document by meaning, even with no shared keywords.
Use SentenceTransformer to embed two sentences and compute their cosine similarity.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
emb = model.encode(["I love machine learning",
"I enjoy studying AI"])
sim = cosine_similarity([emb[0]], [emb[1]])
print("Similarity:", round(sim[0][0], 3))
A high score (~0.71) confirms the two sentences are semantically close even though they share few words.
Give two reasons why RAG is often preferred over fine-tuning for a knowledge-base chatbot.
(1) Instant updates — to add or change knowledge you just edit the vector database; fine-tuning requires costly retraining. (2) Accuracy & citations — RAG grounds answers in retrieved documents, reducing hallucinations and letting you cite the exact source; fine-tuned knowledge is harder to trace and more prone to hallucination. (Also: data stays separate and controlled, which is better for privacy.)