GenAI Exam Prep
Home Mock Exam
⚡ LECTURE 16

RAG — Retrieval-Augmented Generation

Give an LLM an "open-book exam". Learn how RAG grounds models in external data, the retrieve-augment-generate pipeline, vector databases (Chroma, FAISS) and embedding models.

Syllabus topics 62–65 ⏱ ~26 min read 12 practice questions

16.1 Why RAG? Beyond Model Knowledge

LLMs are powerful but have real limitations:

RAG — Retrieval-Augmented Generation — a method that makes an LLM retrieve relevant information from external documents/databases first, then generates its answer using that retrieved information plus its own knowledge.
📖 The open-book exam A plain LLM is a student answering from memory — it may misremember or invent facts. RAG turns it into an open-book exam: the model is handed the relevant pages first and answers using them. Fine-tuning, by contrast, is a student who memorised a textbook — accurate for what was memorised, but needs re-studying for any new material.

What RAG fixes

16.2 How RAG Works — the Pipeline

Query → Embeddings → Retrieve → Rank → Generate
  1. User Query — the user asks a question.
  2. Embedding creation — the query is converted into a vector.
  3. Retrieve — the system searches a vector database for the most relevant document chunks (by vector similarity).
  4. Rank & Select — the best-matching chunks are scored and the top ones chosen.
  5. Augment — the retrieved chunks are added to the prompt as "context".
  6. Generate — the LLM produces the answer using the retrieved context + the query.
🔑 Three words: Retrieve → Augment → Generate Retrieve the relevant documents from your database. Augment the prompt by pasting them in as context. Generate the answer using only that provided context. That is the entire idea of RAG.

16.3 RAG Components

ComponentRole
External DataNew data not in the LLM's training — from APIs, databases, document repositories
RetrieverConverts the query to a vector and finds the most relevant chunks in the vector DB
RankerScores the retrieved chunks so the most relevant appear first, then adds them to the prompt
Generator (LLM)Combines retrieved context + the query to produce the final answer
💡 Tip — why a ranker is needed even after the retriever The retriever returns several candidate chunks but they are not perfectly ordered. The ranker re-scores them so the most relevant chunk is first — which matters because the LLM weights earlier context more, and you may only pass the top few chunks.

Chunking

Large documents are split into small chunks before being embedded, so retrieval is precise — you fetch just the relevant paragraph, not a whole 50-page PDF. Updates can be real-time (added immediately) or batch (periodic).

16.4 Embeddings & Vector Databases

Embeddings — the key to retrieval

ML models cannot understand raw text, so everything is converted into embeddings — vectors of numbers that represent meaning in a multi-dimensional space. Texts with similar meaning have vectors that are close together.

Cosine similarity — a measure of how similar two vectors are: 1 = identical meaning, 0 = unrelated. RAG uses it to find the chunks closest to the query.
🧩 Why store embeddings, not raw text? "How do I return a product?" vs "What is the refund policy?" share no keywords, yet their embeddings have a cosine similarity of ~0.65 (very similar). "What time does the store open?" scores only ~0.11. Embeddings let RAG match by meaning, which keyword search cannot do.

Vector Databases

Vector Database — a database that stores embeddings and finds similar items quickly by comparing vectors. Examples: Chroma, FAISS, Pinecone, Weaviate.
Vector DBNotes
ChromaLightweight, open-source, runs locally with no server — great for development & learning
FAISSFacebook AI Similarity Search — a high-performance library for fast similarity search at scale
Pinecone / WeaviateManaged/cloud vector databases for production

Embedding models

An embedding model is the neural network that converts text into vectors. Common choices: all-MiniLM-L6-v2 (a small, free, fast open model producing 384-dimensional vectors) or commercial APIs like OpenAI's text-embedding-3-small. The same embedding model must be used for both the documents and the query.

16.5 Building a RAG Pipeline

Python · cosine similarity between sentences
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = ["How do I return a product?",
             "What is the refund policy?",     # similar meaning
             "What time does the store open?"] # different topic
emb = model.encode(sentences)

sim = cosine_similarity(emb)
print("A vs B:", round(sim[0][1], 3))   # similar
print("A vs C:", round(sim[0][2], 3))   # different
OutputA vs B: 0.645 A vs C: 0.113
Python · a vector database with Chroma
import chromadb

client = chromadb.Client()
collection = client.create_collection("company_docs")

# Add documents - Chroma embeds them automatically
collection.add(
    documents=[
        "Employees get 18 days of earned leave per year.",
        "Remote work is allowed on Fridays with manager approval.",
        "Health insurance covers the employee and 2 dependents."
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Search by MEANING, not keywords
results = collection.query(
    query_texts=["How many vacation days do I get?"],
    n_results=1
)
print(results["documents"][0])
Output['Employees get 18 days of earned leave per year.']

Note: the query said "vacation days" but the document says "earned leave" — yet RAG found it, because they have similar embeddings.

Python · the complete RAG function
def ask_rag(question):
    # 1. RETRIEVE - find relevant chunks
    results = collection.query(query_texts=[question], n_results=2)
    context = "\n".join(results["documents"][0])

    # 2. AUGMENT - build the prompt with retrieved context
    prompt = f"""Answer using ONLY the context below.
If the answer is not in the context, say "I don't have that information."

Context: {context}
Question: {question}
Answer:"""

    # 3. GENERATE - send to the LLM
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0)              # low temp -> factual
    return response.choices[0].message.content

print(ask_rag("How many leave days do I get per year?"))
OutputYou get 18 days of earned leave per calendar year.
🔑 RAG vs Fine-Tuning RAG = a student who looks things up in books — flexible, cheap to update (just edit the database), great for dynamic/factual knowledge. Fine-Tuning = a student who memorised the textbook — needs re-training for new data, more expensive. For factual, changing knowledge, RAG wins.
? Practice Questions

RAG architecture and vector databases are core exam material.

MCQQ1Definition

RAG stands for:

  • A Random Answer Generation
  • B Retrieval-Augmented Generation
  • C Recursive Algorithmic Grouping
  • D Rapid AI Gateway
Answer: B

RAG = Retrieval-Augmented Generation — retrieve relevant data, augment the prompt with it, then generate the answer.

MCQQ2Purpose

The main problem RAG solves that an LLM alone cannot is:

  • A Making the model run faster
  • B Grounding answers in up-to-date, private or trusted external data
  • C Removing the need for a prompt
  • D Translating between languages
Answer: B

LLMs are frozen in time and lack private data. RAG retrieves external documents so answers are current, grounded and citable — reducing hallucinations.

MCQQ3Pipeline

What is the correct order of the RAG pipeline?

  • A Generate → Retrieve → Augment
  • B Retrieve → Augment → Generate
  • C Augment → Generate → Retrieve
  • D Generate → Augment → Retrieve
Answer: B

First retrieve relevant documents, then augment the prompt with them as context, then generate the answer using that context.

MCQQ4Vector DB

Which of these is a vector database?

  • A MySQL
  • B Chroma
  • C Pandas
  • D NumPy
Answer: B

Chroma (and FAISS, Pinecone, Weaviate) are vector databases that store embeddings and find similar items by vector comparison. MySQL is a relational DB.

MCQQ5Embeddings

Why does RAG store documents as embeddings rather than raw text?

  • A Embeddings take more memory
  • B Embeddings allow searching by meaning, finding relevant text even without shared keywords
  • C Raw text cannot be stored on disk
  • D Embeddings translate the text
Answer: B

Embeddings capture semantic meaning, so a query for "vacation days" can match a document about "earned leave" — keyword search would miss it.

MCQQ6Components

In a RAG system, which component reduces hallucinations the most?

  • A Retrieval — it grounds the answer in real documents
  • B The temperature parameter
  • C The tokenizer
  • D The number of GPUs
Answer: A

Retrieval supplies real, trusted documents as context, so the generator answers from facts rather than inventing them.

MCQQ7Cosine similarity

A cosine similarity of 1.0 between two text embeddings means the texts are:

  • A Completely unrelated
  • B Identical / extremely similar in meaning
  • C In different languages
  • D Both empty
Answer: B

Cosine similarity ranges 0→1: 1 means the vectors point the same way (identical meaning), 0 means unrelated.

MCQQ8RAG vs Fine-tuning

An advantage of RAG over fine-tuning for factual knowledge is that RAG:

  • A Permanently changes the model's weights
  • B Can be updated instantly by editing the database, with no retraining
  • C Never needs any documents
  • D Works only for images
Answer: B

RAG knowledge lives in a database — update it any time with no costly retraining. Fine-tuning bakes knowledge into the weights and needs retraining to change.

Short AnswerQ9Concept

Explain the three stages of RAG (Retrieve, Augment, Generate) in one sentence each.

Model answer

Retrieve: the user's query is embedded and the vector database is searched for the most relevant document chunks. Augment: those retrieved chunks are inserted into the prompt as context. Generate: the LLM produces the final answer using that provided context together with the question.

CodingQ10Chroma DB

Write code to create a Chroma collection, add three documents to it, and query for the most relevant one.

Solution
Python
import chromadb

client = chromadb.Client()
collection = client.create_collection("kb")

collection.add(
    documents=["Refunds take 5 business days.",
               "The store opens at 9 AM.",
               "Shipping is free over $50."],
    ids=["d1", "d2", "d3"]
)

result = collection.query(
    query_texts=["When will I get my money back?"],
    n_results=1
)
print(result["documents"][0])
Output['Refunds take 5 business days.']

"money back" matched the refund document by meaning, even with no shared keywords.

CodingQ11Embeddings

Use SentenceTransformer to embed two sentences and compute their cosine similarity.

Solution
Python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")
emb = model.encode(["I love machine learning",
                    "I enjoy studying AI"])

sim = cosine_similarity([emb[0]], [emb[1]])
print("Similarity:", round(sim[0][0], 3))
OutputSimilarity: 0.71

A high score (~0.71) confirms the two sentences are semantically close even though they share few words.

Short AnswerQ12RAG vs Fine-tune

Give two reasons why RAG is often preferred over fine-tuning for a knowledge-base chatbot.

Model answer

(1) Instant updates — to add or change knowledge you just edit the vector database; fine-tuning requires costly retraining. (2) Accuracy & citations — RAG grounds answers in retrieved documents, reducing hallucinations and letting you cite the exact source; fine-tuned knowledge is harder to trace and more prone to hallucination. (Also: data stays separate and controlled, which is better for privacy.)

🎯 Lecture 16 — must-remember RAG = Retrieve → Augment → Generate; gives the LLM an "open-book exam". Pipeline: Query → Embeddings → Retrieve → Rank → Generate. Components: External data, Retriever, Ranker, Generator. Embeddings capture meaning; cosine similarity 1=identical, 0=unrelated. Vector DBs: Chroma (lightweight, local), FAISS (fast similarity search). RAG beats fine-tuning for changing/factual knowledge.