GenAI Exam Prep
Home Mock Exam
⚑ LECTURE 10

LLM Architecture

How ChatGPT actually works. Learn the Transformer architecture, the self-attention mechanism (Q, K, V) that powers it, and the ethics of building responsible AI.

Syllabus topics 35–38 ⏱ ~28 min read 13 practice questions

10.1 What is a Large Language Model?

Large Language Model (LLM) β€” a program trained on massive amounts of text to predict the next word (token) in a sequence. "Large" refers to the billions of parameters (adjustable numbers) it uses to recognise language patterns.
πŸ”‘ It feels like magic, but it is not At its heart an LLM is a sophisticated next-word prediction engine. It does not "understand" meaning β€” it recognises statistical patterns from its training text and calculates the most probable next word, then the next, and so on. Your phone's keyboard does a tiny version of this; ChatGPT does it at enormous scale.

The problem with old (sequential) models

Early models (RNNs) read text one word at a time, like reading a novel through a straw. In a long sentence β€” "The cat, which had been sleeping peacefully on the warm windowsill all afternoon, suddenly jumped" β€” by the time the model reaches "jumped" it has largely forgotten "the cat". This is the problem of context loss.

10.2 The Transformer Architecture

Transformer β€” the architecture introduced in 2017 ("Attention Is All You Need") that processes the entire sentence at once instead of word-by-word. It can weigh the importance of every word relative to every other word, no matter how far apart.
Old: RNN (sequential)New: Transformer (parallel)
Reading styleOne word at a timeAll words at once
AnalogyReading through a narrow strawSeeing the whole picture instantly
ProblemForgets the start by the endFull context understood
SpeedSlow β€” cannot paralleliseFast β€” parallel on GPUs

The Transformer enabled ChatGPT, GPT-4, BERT and virtually every modern LLM.

Encoder and Decoder

Inside the blocks

Encoder blockDecoder block
Self-Attention β€” who relates to whomMasked Attention β€” same, but cannot "peek" at future words
Add & Norm β€” stabilise / "reality check"Cross-Attention β€” asks the Encoder for help
Feed Forward β€” process the meaningFeed Forward + Next-Word Prediction
πŸ’‘ Tip β€” why "Masked" attention in the decoder? When generating word #5, the decoder must not see words #6, #7… that it has not written yet. Masked attention hides ("masks") future words so the model only attends to what it has already produced β€” like a test where future answers are covered.

10.3 The Self-Attention Mechanism

Self-Attention β€” the mechanism that lets the model give more weight (attention) to the words most relevant for understanding any given word, building a network of relationships between all words.
🧩 The pronoun problem "The trophy doesn't fit in the suitcase because it is too large." What does "it" refer to β€” the trophy or the suitcase? Humans instantly know it's the trophy. Self-attention lets the model "attend to" earlier words and assign "trophy" a high attention score when processing "it".

Query, Key, Value

Every word produces three vectors via learned weight matrices:

VectorMeaningAnalogy
Q β€” QueryWhat I'm looking for"What information do I need?"
K β€” KeyWhat I offer"Here's what I can tell you about"
V β€” ValueWhat I actually sayThe real information content passed along
dkDimension of KA scaling factor for numerical stability

The attention formula

Attention(Q, K, V) = softmax( Q·KT / √dk ) · V

The four steps

  1. Score β€” take the dot product of a word's Query with every word's Key. A high dot product = high relevance.
  2. Scale β€” divide the scores by √dk to keep the numbers stable during training.
  3. Softmax β€” convert the scaled scores into attention weights (positive, summing to 1) β€” percentages of attention.
  4. Weighted sum β€” the new representation of the word is the weighted sum of all Value vectors, using those attention weights.
🧩 Worked example (from the worksheet) Sentence "Transformers are amazing". For "Transformers", scores against [Transformers, are, amazing] = [5, 1, 6]. Softmax β†’ roughly [27%, 0.5%, 72.5%]. So "Transformers" pays 72.5% attention to "amazing". Final context vector = 0.27Β·V(Transformers) + 0.725Β·V(amazing) β‰ˆ a brand-new vector that now "knows" Transformers are amazing.

Why self-attention is revolutionary

10.4 Tokenization & Positional Encoding

Tokenization (text β†’ numbers)

Each word is converted to an ID (an index, like a menu item number), then to a vector (the actual meaning/embedding). Computers need the simple ID to look up the complex math.

Positional Encoding

πŸ”‘ Why positional encoding is essential Because the Transformer reads all words at the same time, it has no built-in sense of order β€” "Dog bites Man" and "Man bites Dog" would look identical. Positional Encoding adds a "position tag" to each word's vector so the model knows the sequence order.

10.5 Ethics & Responsible AI

LLMs are trained on a vast snapshot of the public internet β€” so they learn human knowledge and human biases, errors and misinformation. The model cannot tell a peer-reviewed paper from a conspiracy blog; it treats all text as equally valid.

Failure Mode 1 β€” Bias

AI Bias β€” when a model's outputs create or reinforce unfair prejudice. Bias β‰  malice β€” the AI is not evil; it simply repeats patterns from historical training data.

Example: a rΓ©sumΓ©-screening AI learned to reject female candidates because historical hiring data was mostly men. If training data shows most CEOs are men, the model may complete "The CEO finished his…" with gendered language.

Failure Mode 2 β€” Hallucinations

Hallucination β€” when an LLM confidently states false information. LLMs are designed for fluency, not truth β€” they assemble the most probable sequence of words, with no built-in fact-checking.

Real case: a lawyer used ChatGPT to research legal cases; it confidently cited six cases β€” all completely fabricated, with made-up names and judges β€” and he submitted them to court before discovering they didn't exist. Never trust an LLM blindly for medical, legal or financial advice. LLMs are probabilistic, not deterministic.

Preventing misinformation

The four principles of Responsible AI

PrincipleMeaningAnalogy
FairnessThe system should not create or reinforce unfair biasA fair referee applying the same rules to everyone
TransparencyWe should understand how the system makes decisionsShowing your work on a maths problem
AccountabilityPeople are responsible for AI outcomesA captain responsible for their ship
Safety & PrivacySystems must be reliable, secure and minimise harmSeatbelts in cars
πŸ’‘ Tip β€” reducing hallucinations in practice Instruct the model to admit ignorance: "Answer only using the provided text", "If you cannot find the answer, say 'I don't know'." Setting temperature to 0 also makes outputs more factual and deterministic. (RAG, Lecture 16, grounds answers in real documents.)
? Practice Questions

Self-attention and responsible AI are core exam topics.

MCQQ1LLM basics

At its core, an LLM works mostly by:

  • A Having feelings and consciousness
  • B Predicting the most likely next word/token using statistical patterns
  • C Searching the entire internet for each question
  • D Storing every sentence it has ever read word-for-word
Answer: B

An LLM is a next-word prediction engine. It calculates probabilities over possible next tokens from patterns learned in training β€” it does not understand or browse live.

MCQQ2Transformer

The key innovation of the Transformer over older RNN models is that it:

  • A Reads text strictly one word at a time
  • B Processes all words in parallel, weighing every word against every other
  • C Never makes mistakes
  • D Uses no training data
Answer: B

Transformers process the whole sequence at once and can relate distant words, solving the context-loss problem of sequential RNNs and enabling parallel GPU training.

MCQQ3Encoder/Decoder

Which part of a Transformer generates the output text word-by-word?

  • A The Encoder
  • B The Decoder
  • C The Tokenizer
  • D The Optimizer
Answer: B

The Encoder builds an understanding of the input; the Decoder uses that understanding to generate the output one word at a time.

MCQQ4Q/K/V

In the self-attention formula, which vector represents "what I'm looking for"?

  • A Query (Q)
  • B Key (K)
  • C Value (V)
  • D Softmax
Answer: A

Query = "what do I need?", Key = "what do I offer?", Value = "the actual information". A word's Query is matched against all Keys.

MCQQ5Self-attention

In the attention mechanism, what does the softmax step produce?

  • A The final output sentence
  • B Attention weights β€” positive numbers that sum to 1
  • C The word embeddings
  • D The loss value
Answer: B

Softmax converts the scaled relevance scores into attention weights (percentages summing to 1), which then weight the Value vectors.

MCQQ6Positional encoding

Without Positional Encoding, a Transformer would:

  • A Forget the words entirely
  • B Treat "Dog bites Man" and "Man bites Dog" as identical
  • C Run much slower
  • D Stop hallucinating
Answer: B

Because all words are processed simultaneously, the model has no inherent sense of order. Positional encoding adds a position tag so word order is preserved.

MCQQ7Masked attention

Why does the Decoder use "Masked" attention?

  • A To make the maths faster
  • B To stop the model "peeking" at future words it has not generated yet
  • C To keep the words private
  • D To remove stop words
Answer: B

When generating word N, the decoder must only see words 1…Nβˆ’1. Masking hides future positions so generation stays causal.

MCQQ8Hallucination

A "hallucination" in an LLM means:

  • A The AI has stopped working
  • B The AI confidently states something that is factually false
  • C The AI is dreaming
  • D The AI refuses to answer
Answer: B

An LLM predicts what sounds right, not what is right. With no fact-checking, it can fluently produce false but plausible information.

MCQQ9Bias

A hiring AI rejects female candidates. The most accurate explanation is:

  • A The AI is evil and intends harm
  • B It learned biased patterns from historical training data (bias β‰  malice)
  • C The AI ran out of memory
  • D The AI was not trained at all
Answer: B

AI bias comes from patterns in historical data, not intent. The data showed mostly men were hired, so the model repeated that pattern.

NumericalQ10Attention score

A word's Query vector is [2, 1] and another word's Key vector is [2, 2]. Compute their attention (dot-product) score.

Answer: 6

Dot product = (2Γ—2) + (1Γ—2) = 4 + 2 = 6. A higher dot product means higher relevance between the two words.

CodingQ11Softmax

Write a numpy function softmax(scores) that converts raw attention scores into weights summing to 1. Test on [5, 1, 6].

Solution
Python
import numpy as np

def softmax(scores):
    scores = np.array(scores)
    exp = np.exp(scores - np.max(scores))   # subtract max for stability
    return exp / exp.sum()

weights = softmax([5, 1, 6])
print("Weights:", weights.round(3))
print("Sum    :", weights.sum())
OutputWeights: [0.265 0.005 0.73 ] Sum : 1.0

Softmax converts the scores into attention weights β€” the third word (score 6) gets the most attention (73%).

Short AnswerQ12Self-attention

In one or two sentences, explain how self-attention helps the model resolve the word "it" in the sentence "The trophy doesn't fit in the suitcase because it is too large."

Model answer

When processing "it", self-attention compares its Query vector against the Key vectors of every other word and computes relevance scores. The word "trophy" earns a high attention weight, so the model's representation of "it" is built largely from "trophy" β€” correctly resolving the reference.

Short AnswerQ13Responsible AI

Name the four principles of Responsible AI and give a one-line meaning for each.

Model answer

Fairness β€” the system should not create or reinforce unfair bias. Transparency β€” we should understand how it makes decisions. Accountability β€” humans remain responsible for the AI's outcomes. Safety & Privacy β€” the system must be reliable, secure and minimise harm to people.

🎯 Lecture 10 β€” must-remember LLM = next-token predictor with billions of parameters. Transformer (2017) processes all words in parallel; Encoder understands, Decoder generates. Self-attention: Attention = softmax(QΒ·Kα΅€/√dβ‚–)Β·V β€” Query/Key/Value. Positional encoding gives word order. Hallucination = confident falsehood; bias β‰  malice. Responsible AI: Fairness, Transparency, Accountability, Safety & Privacy.