⚡ LECTURE 16

RAG — Retrieval-Augmented Generation

Give an LLM an "open-book exam". Learn how RAG grounds models in external data, the retrieve-augment-generate pipeline, vector databases (Chroma, FAISS) and embedding models.

Syllabus topics 62–65 ⏱ ~26 min read 12 practice questions

In this lecture

Why RAG? Beyond Model Knowledge
How RAG Works — the Pipeline
RAG Components
Embeddings & Vector Databases
Building a RAG Pipeline
Practice Questions

16.1 Why RAG? Beyond Model Knowledge

LLMs are powerful but have real limitations:

Frozen in time — they only know information up to their training cutoff; they cannot answer about recent events.
Hallucinations — they confidently produce wrong answers.
No private knowledge — they do not know your company rules or personal documents.
No sources — they cannot show where information came from, so it is hard to verify.

RAG — Retrieval-Augmented Generation — a method that makes an LLM retrieve relevant information from external documents/databases first, then generates its answer using that retrieved information plus its own knowledge.

📖 The open-book exam A plain LLM is a student answering from memory — it may misremember or invent facts. RAG turns it into an open-book exam: the model is handed the relevant pages first and answers using them. Fine-tuning, by contrast, is a student who memorised a textbook — accurate for what was memorised, but needs re-studying for any new material.

What RAG fixes

Gives more accurate, up-to-date answers.
Reduces hallucinations by grounding answers in real, trusted data.
Handles specific/private topics using your own documents.
Enables citations — you can show the source of the answer.

16.2 How RAG Works — the Pipeline

Query → Embeddings → Retrieve → Rank → Generate

User Query — the user asks a question.
Embedding creation — the query is converted into a vector.
Retrieve — the system searches a vector database for the most relevant document chunks (by vector similarity).
Rank & Select — the best-matching chunks are scored and the top ones chosen.
Augment — the retrieved chunks are added to the prompt as "context".
Generate — the LLM produces the answer using the retrieved context + the query.

🔑 Three words: Retrieve → Augment → Generate Retrieve the relevant documents from your database. Augment the prompt by pasting them in as context. Generate the answer using only that provided context. That is the entire idea of RAG.

16.3 RAG Components

Component	Role
External Data	New data not in the LLM's training — from APIs, databases, document repositories
Retriever	Converts the query to a vector and finds the most relevant chunks in the vector DB
Ranker	Scores the retrieved chunks so the most relevant appear first, then adds them to the prompt
Generator (LLM)	Combines retrieved context + the query to produce the final answer

💡 Tip — why a ranker is needed even after the retriever The retriever returns several candidate chunks but they are not perfectly ordered. The ranker re-scores them so the most relevant chunk is first — which matters because the LLM weights earlier context more, and you may only pass the top few chunks.

Chunking

Large documents are split into small chunks before being embedded, so retrieval is precise — you fetch just the relevant paragraph, not a whole 50-page PDF. Updates can be real-time (added immediately) or batch (periodic).

16.4 Embeddings & Vector Databases

Embeddings — the key to retrieval

ML models cannot understand raw text, so everything is converted into embeddings — vectors of numbers that represent meaning in a multi-dimensional space. Texts with similar meaning have vectors that are close together.

Cosine similarity — a measure of how similar two vectors are: 1 = identical meaning, 0 = unrelated. RAG uses it to find the chunks closest to the query.

🧩 Why store embeddings, not raw text? "How do I return a product?" vs "What is the refund policy?" share no keywords, yet their embeddings have a cosine similarity of ~0.65 (very similar). "What time does the store open?" scores only ~0.11. Embeddings let RAG match by meaning, which keyword search cannot do.

Vector Databases

Vector Database — a database that stores embeddings and finds similar items quickly by comparing vectors. Examples: Chroma, FAISS, Pinecone, Weaviate.

Vector DB	Notes
Chroma	Lightweight, open-source, runs locally with no server — great for development & learning
FAISS	Facebook AI Similarity Search — a high-performance library for fast similarity search at scale
Pinecone / Weaviate	Managed/cloud vector databases for production

Embedding models

An embedding model is the neural network that converts text into vectors. Common choices: all-MiniLM-L6-v2 (a small, free, fast open model producing 384-dimensional vectors) or commercial APIs like OpenAI's text-embedding-3-small. The same embedding model must be used for both the documents and the query.

16.5 Building a RAG Pipeline

Python · cosine similarity between sentences

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["How do I return a product?",
             "What is the refund policy?",     # similar meaning
             "What time does the store open?"] # different topic
emb = model.encode(sentences)

sim = cosine_similarity(emb)
print("A vs B:", round(sim[0][1], 3))   # similar
print("A vs C:", round(sim[0][2], 3))   # different

OutputA vs B: 0.645 A vs C: 0.113

Python · a vector database with Chroma

import chromadb

client = chromadb.Client()
collection = client.create_collection("company_docs")

# Add documents - Chroma embeds them automatically
collection.add(
    documents=[
        "Employees get 18 days of earned leave per year.",
        "Remote work is allowed on Fridays with manager approval.",
        "Health insurance covers the employee and 2 dependents."
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Search by MEANING, not keywords
results = collection.query(
    query_texts=["How many vacation days do I get?"],
    n_results=1
)
print(results["documents"][0])

Output['Employees get 18 days of earned leave per year.']

Note: the query said "vacation days" but the document says "earned leave" — yet RAG found it, because they have similar embeddings.

Python · the complete RAG function

def ask_rag(question):
    # 1. RETRIEVE - find relevant chunks
    results = collection.query(query_texts=[question], n_results=2)
    context = "\n".join(results["documents"][0])

    # 2. AUGMENT - build the prompt with retrieved context
    prompt = f"""Answer using ONLY the context below.
If the answer is not in the context, say "I don't have that information."

Context: {context}
Question: {question}
Answer:"""

    # 3. GENERATE - send to the LLM
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0)              # low temp -> factual
    return response.choices[0].message.content

print(ask_rag("How many leave days do I get per year?"))

OutputYou get 18 days of earned leave per calendar year.

🔑 RAG vs Fine-Tuning RAG = a student who looks things up in books — flexible, cheap to update (just edit the database), great for dynamic/factual knowledge. Fine-Tuning = a student who memorised the textbook — needs re-training for new data, more expensive. For factual, changing knowledge, RAG wins.

? Practice Questions

RAG architecture and vector databases are core exam material.

MCQQ1Definition

RAG stands for:

A Random Answer Generation
B Retrieval-Augmented Generation
C Recursive Algorithmic Grouping
D Rapid AI Gateway

Answer: B

RAG = Retrieval-Augmented Generation — retrieve relevant data, augment the prompt with it, then generate the answer.

MCQQ2Purpose

The main problem RAG solves that an LLM alone cannot is:

A Making the model run faster
B Grounding answers in up-to-date, private or trusted external data
C Removing the need for a prompt
D Translating between languages

Answer: B

LLMs are frozen in time and lack private data. RAG retrieves external documents so answers are current, grounded and citable — reducing hallucinations.

MCQQ3Pipeline

What is the correct order of the RAG pipeline?

A Generate → Retrieve → Augment
B Retrieve → Augment → Generate
C Augment → Generate → Retrieve
D Generate → Augment → Retrieve

Answer: B

First retrieve relevant documents, then augment the prompt with them as context, then generate the answer using that context.

MCQQ4Vector DB

Which of these is a vector database?

A MySQL
B Chroma
C Pandas
D NumPy

Answer: B

Chroma (and FAISS, Pinecone, Weaviate) are vector databases that store embeddings and find similar items by vector comparison. MySQL is a relational DB.

MCQQ5Embeddings

Why does RAG store documents as embeddings rather than raw text?

A Embeddings take more memory
B Embeddings allow searching by meaning, finding relevant text even without shared keywords
C Raw text cannot be stored on disk
D Embeddings translate the text

Answer: B

Embeddings capture semantic meaning, so a query for "vacation days" can match a document about "earned leave" — keyword search would miss it.

MCQQ6Components

In a RAG system, which component reduces hallucinations the most?

A Retrieval — it grounds the answer in real documents
B The temperature parameter
C The tokenizer
D The number of GPUs

Answer: A

Retrieval supplies real, trusted documents as context, so the generator answers from facts rather than inventing them.

MCQQ7Cosine similarity

A cosine similarity of 1.0 between two text embeddings means the texts are:

A Completely unrelated
B Identical / extremely similar in meaning
C In different languages
D Both empty

Answer: B

Cosine similarity ranges 0→1: 1 means the vectors point the same way (identical meaning), 0 means unrelated.

MCQQ8RAG vs Fine-tuning

An advantage of RAG over fine-tuning for factual knowledge is that RAG:

A Permanently changes the model's weights
B Can be updated instantly by editing the database, with no retraining
C Never needs any documents
D Works only for images

Answer: B

RAG knowledge lives in a database — update it any time with no costly retraining. Fine-tuning bakes knowledge into the weights and needs retraining to change.

Short AnswerQ9Concept

Explain the three stages of RAG (Retrieve, Augment, Generate) in one sentence each.

Model answer

Retrieve: the user's query is embedded and the vector database is searched for the most relevant document chunks. Augment: those retrieved chunks are inserted into the prompt as context. Generate: the LLM produces the final answer using that provided context together with the question.

CodingQ10Chroma DB

Write code to create a Chroma collection, add three documents to it, and query for the most relevant one.

Solution

Python

import chromadb

client = chromadb.Client()
collection = client.create_collection("kb")

collection.add(
    documents=["Refunds take 5 business days.",
               "The store opens at 9 AM.",
               "Shipping is free over $50."],
    ids=["d1", "d2", "d3"]
)

result = collection.query(
    query_texts=["When will I get my money back?"],
    n_results=1
)
print(result["documents"][0])

Output['Refunds take 5 business days.']

"money back" matched the refund document by meaning, even with no shared keywords.

CodingQ11Embeddings

Use SentenceTransformer to embed two sentences and compute their cosine similarity.

Solution

Python

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")
emb = model.encode(["I love machine learning",
                    "I enjoy studying AI"])

sim = cosine_similarity([emb[0]], [emb[1]])
print("Similarity:", round(sim[0][0], 3))

OutputSimilarity: 0.71

A high score (~0.71) confirms the two sentences are semantically close even though they share few words.

Short AnswerQ12RAG vs Fine-tune

Give two reasons why RAG is often preferred over fine-tuning for a knowledge-base chatbot.

Model answer

(1) Instant updates — to add or change knowledge you just edit the vector database; fine-tuning requires costly retraining. (2) Accuracy & citations — RAG grounds answers in retrieved documents, reducing hallucinations and letting you cite the exact source; fine-tuned knowledge is harder to trace and more prone to hallucination. (Also: data stays separate and controlled, which is better for privacy.)

🎯 Lecture 16 — must-remember RAG = Retrieve → Augment → Generate; gives the LLM an "open-book exam". Pipeline: Query → Embeddings → Retrieve → Rank → Generate. Components: External data, Retriever, Ranker, Generator. Embeddings capture meaning; cosine similarity 1=identical, 0=unrelated. Vector DBs: Chroma (lightweight, local), FAISS (fast similarity search). RAG beats fine-tuning for changing/factual knowledge.

← Previous

Managing State in Chatbots

Rapid Prototyping Tools