⚡ LECTURE 10

LLM Architecture

How ChatGPT actually works. Learn the Transformer architecture, the self-attention mechanism (Q, K, V) that powers it, and the ethics of building responsible AI.

Syllabus topics 35–38 ⏱ ~28 min read 13 practice questions

In this lecture

What is a Large Language Model?
The Transformer Architecture
Self-Attention (Q, K, V)
Tokenization & Positional Encoding
Ethics & Responsible AI
Practice Questions

10.1 What is a Large Language Model?

Large Language Model (LLM) — a program trained on massive amounts of text to predict the next word (token) in a sequence. "Large" refers to the billions of parameters (adjustable numbers) it uses to recognise language patterns.

🔑 It feels like magic, but it is not At its heart an LLM is a sophisticated next-word prediction engine. It does not "understand" meaning — it recognises statistical patterns from its training text and calculates the most probable next word, then the next, and so on. Your phone's keyboard does a tiny version of this; ChatGPT does it at enormous scale.

The problem with old (sequential) models

Early models (RNNs) read text one word at a time, like reading a novel through a straw. In a long sentence — "The cat, which had been sleeping peacefully on the warm windowsill all afternoon, suddenly jumped" — by the time the model reaches "jumped" it has largely forgotten "the cat". This is the problem of context loss.

10.2 The Transformer Architecture

Transformer — the architecture introduced in 2017 ("Attention Is All You Need") that processes the entire sentence at once instead of word-by-word. It can weigh the importance of every word relative to every other word, no matter how far apart.

	Old: RNN (sequential)	New: Transformer (parallel)
Reading style	One word at a time	All words at once
Analogy	Reading through a narrow straw	Seeing the whole picture instantly
Problem	Forgets the start by the end	Full context understood
Speed	Slow — cannot parallelise	Fast — parallel on GPUs

The Transformer enabled ChatGPT, GPT-4, BERT and virtually every modern LLM.

Encoder and Decoder

Encoder (Understanding) — reads input text and builds a rich, contextual understanding of its meaning. Like a student listening carefully and taking notes.
Decoder (Generating) — takes that understanding and generates new text, one word at a time. Like the same student writing an essay from their notes.

Inside the blocks

Encoder block	Decoder block
Self-Attention — who relates to whom	Masked Attention — same, but cannot "peek" at future words
Add & Norm — stabilise / "reality check"	Cross-Attention — asks the Encoder for help
Feed Forward — process the meaning	Feed Forward + Next-Word Prediction

💡 Tip — why "Masked" attention in the decoder? When generating word #5, the decoder must not see words #6, #7… that it has not written yet. Masked attention hides ("masks") future words so the model only attends to what it has already produced — like a test where future answers are covered.

10.3 The Self-Attention Mechanism

Self-Attention — the mechanism that lets the model give more weight (attention) to the words most relevant for understanding any given word, building a network of relationships between all words.

🧩 The pronoun problem "The trophy doesn't fit in the suitcase because it is too large." What does "it" refer to — the trophy or the suitcase? Humans instantly know it's the trophy. Self-attention lets the model "attend to" earlier words and assign "trophy" a high attention score when processing "it".

Query, Key, Value

Every word produces three vectors via learned weight matrices:

Vector	Meaning	Analogy
Q — Query	What I'm looking for	"What information do I need?"
K — Key	What I offer	"Here's what I can tell you about"
V — Value	What I actually say	The real information content passed along
d_k	Dimension of K	A scaling factor for numerical stability

The attention formula

Attention(Q, K, V) = softmax( Q·K^T / √d_k ) · V

The four steps

Score — take the dot product of a word's Query with every word's Key. A high dot product = high relevance.
Scale — divide the scores by √d_k to keep the numbers stable during training.
Softmax — convert the scaled scores into attention weights (positive, summing to 1) — percentages of attention.
Weighted sum — the new representation of the word is the weighted sum of all Value vectors, using those attention weights.

🧩 Worked example (from the worksheet) Sentence "Transformers are amazing". For "Transformers", scores against [Transformers, are, amazing] = [5, 1, 6]. Softmax → roughly [27%, 0.5%, 72.5%]. So "Transformers" pays 72.5% attention to "amazing". Final context vector = 0.27·V(Transformers) + 0.725·V(amazing) ≈ a brand-new vector that now "knows" Transformers are amazing.

Why self-attention is revolutionary

Global context — every word can directly attend to every other word, regardless of distance. No information bottleneck.
Parallel processing — all attention computations happen at once → massive GPU speed-ups.
Multi-Head Attention — run several attention mechanisms in parallel to capture different relationship types (syntax, semantics) simultaneously.

10.4 Tokenization & Positional Encoding

Tokenization (text → numbers)

Each word is converted to an ID (an index, like a menu item number), then to a vector (the actual meaning/embedding). Computers need the simple ID to look up the complex math.

Positional Encoding

🔑 Why positional encoding is essential Because the Transformer reads all words at the same time, it has no built-in sense of order — "Dog bites Man" and "Man bites Dog" would look identical. Positional Encoding adds a "position tag" to each word's vector so the model knows the sequence order.

10.5 Ethics & Responsible AI

LLMs are trained on a vast snapshot of the public internet — so they learn human knowledge and human biases, errors and misinformation. The model cannot tell a peer-reviewed paper from a conspiracy blog; it treats all text as equally valid.

Failure Mode 1 — Bias

AI Bias — when a model's outputs create or reinforce unfair prejudice. Bias ≠ malice — the AI is not evil; it simply repeats patterns from historical training data.

Example: a résumé-screening AI learned to reject female candidates because historical hiring data was mostly men. If training data shows most CEOs are men, the model may complete "The CEO finished his…" with gendered language.

Failure Mode 2 — Hallucinations

Hallucination — when an LLM confidently states false information. LLMs are designed for fluency, not truth — they assemble the most probable sequence of words, with no built-in fact-checking.

Real case: a lawyer used ChatGPT to research legal cases; it confidently cited six cases — all completely fabricated, with made-up names and judges — and he submitted them to court before discovering they didn't exist. Never trust an LLM blindly for medical, legal or financial advice. LLMs are probabilistic, not deterministic.

Preventing misinformation

Biased training data → audit datasets for representation and balance.
Feature selection → carefully evaluate which attributes are truly relevant and fair.
Evaluation metrics → test performance across different demographic groups, not just overall accuracy.
Deployment context → continuous monitoring and human oversight of AI decisions.

The four principles of Responsible AI

Principle	Meaning	Analogy
Fairness	The system should not create or reinforce unfair bias	A fair referee applying the same rules to everyone
Transparency	We should understand how the system makes decisions	Showing your work on a maths problem
Accountability	People are responsible for AI outcomes	A captain responsible for their ship
Safety & Privacy	Systems must be reliable, secure and minimise harm	Seatbelts in cars

💡 Tip — reducing hallucinations in practice Instruct the model to admit ignorance: "Answer only using the provided text", "If you cannot find the answer, say 'I don't know'." Setting temperature to 0 also makes outputs more factual and deterministic. (RAG, Lecture 16, grounds answers in real documents.)

? Practice Questions

Self-attention and responsible AI are core exam topics.

MCQQ1LLM basics

At its core, an LLM works mostly by:

A Having feelings and consciousness
B Predicting the most likely next word/token using statistical patterns
C Searching the entire internet for each question
D Storing every sentence it has ever read word-for-word

Answer: B

An LLM is a next-word prediction engine. It calculates probabilities over possible next tokens from patterns learned in training — it does not understand or browse live.

MCQQ2Transformer

The key innovation of the Transformer over older RNN models is that it:

A Reads text strictly one word at a time
B Processes all words in parallel, weighing every word against every other
C Never makes mistakes
D Uses no training data

Answer: B

Transformers process the whole sequence at once and can relate distant words, solving the context-loss problem of sequential RNNs and enabling parallel GPU training.

MCQQ3Encoder/Decoder

Which part of a Transformer generates the output text word-by-word?

A The Encoder
B The Decoder
C The Tokenizer
D The Optimizer

Answer: B

The Encoder builds an understanding of the input; the Decoder uses that understanding to generate the output one word at a time.

MCQQ4Q/K/V

In the self-attention formula, which vector represents "what I'm looking for"?

A Query (Q)
B Key (K)
C Value (V)
D Softmax

Answer: A

Query = "what do I need?", Key = "what do I offer?", Value = "the actual information". A word's Query is matched against all Keys.

MCQQ5Self-attention

In the attention mechanism, what does the softmax step produce?

A The final output sentence
B Attention weights — positive numbers that sum to 1
C The word embeddings
D The loss value

Answer: B

Softmax converts the scaled relevance scores into attention weights (percentages summing to 1), which then weight the Value vectors.

MCQQ6Positional encoding

Without Positional Encoding, a Transformer would:

A Forget the words entirely
B Treat "Dog bites Man" and "Man bites Dog" as identical
C Run much slower
D Stop hallucinating

Answer: B

Because all words are processed simultaneously, the model has no inherent sense of order. Positional encoding adds a position tag so word order is preserved.

MCQQ7Masked attention

Why does the Decoder use "Masked" attention?

A To make the maths faster
B To stop the model "peeking" at future words it has not generated yet
C To keep the words private
D To remove stop words

Answer: B

When generating word N, the decoder must only see words 1…N−1. Masking hides future positions so generation stays causal.

MCQQ8Hallucination

A "hallucination" in an LLM means:

A The AI has stopped working
B The AI confidently states something that is factually false
C The AI is dreaming
D The AI refuses to answer

Answer: B

An LLM predicts what sounds right, not what is right. With no fact-checking, it can fluently produce false but plausible information.

MCQQ9Bias

A hiring AI rejects female candidates. The most accurate explanation is:

A The AI is evil and intends harm
B It learned biased patterns from historical training data (bias ≠ malice)
C The AI ran out of memory
D The AI was not trained at all

Answer: B

AI bias comes from patterns in historical data, not intent. The data showed mostly men were hired, so the model repeated that pattern.

NumericalQ10Attention score

A word's Query vector is [2, 1] and another word's Key vector is [2, 2]. Compute their attention (dot-product) score.

Answer: 6

Dot product = (2×2) + (1×2) = 4 + 2 = 6. A higher dot product means higher relevance between the two words.

CodingQ11Softmax

Write a numpy function softmax(scores) that converts raw attention scores into weights summing to 1. Test on [5, 1, 6].

Solution

Python

import numpy as np

def softmax(scores):
    scores = np.array(scores)
    exp = np.exp(scores - np.max(scores))   # subtract max for stability
    return exp / exp.sum()

weights = softmax([5, 1, 6])
print("Weights:", weights.round(3))
print("Sum    :", weights.sum())

OutputWeights: [0.265 0.005 0.73 ] Sum : 1.0

Softmax converts the scores into attention weights — the third word (score 6) gets the most attention (73%).

Short AnswerQ12Self-attention

In one or two sentences, explain how self-attention helps the model resolve the word "it" in the sentence "The trophy doesn't fit in the suitcase because it is too large."

Model answer

When processing "it", self-attention compares its Query vector against the Key vectors of every other word and computes relevance scores. The word "trophy" earns a high attention weight, so the model's representation of "it" is built largely from "trophy" — correctly resolving the reference.

Short AnswerQ13Responsible AI

Name the four principles of Responsible AI and give a one-line meaning for each.

Model answer

Fairness — the system should not create or reinforce unfair bias. Transparency — we should understand how it makes decisions. Accountability — humans remain responsible for the AI's outcomes. Safety & Privacy — the system must be reliable, secure and minimise harm to people.

🎯 Lecture 10 — must-remember LLM = next-token predictor with billions of parameters. Transformer (2017) processes all words in parallel; Encoder understands, Decoder generates. Self-attention: Attention = softmax(Q·Kᵀ/√dₖ)·V — Query/Key/Value. Positional encoding gives word order. Hallucination = confident falsehood; bias ≠ malice. Responsible AI: Fairness, Transparency, Accountability, Safety & Privacy.

← Previous

Natural Language Processing

Generative AI Modalities