LLM Architecture
How ChatGPT actually works. Learn the Transformer architecture, the self-attention mechanism (Q, K, V) that powers it, and the ethics of building responsible AI.
In this lecture
10.1 What is a Large Language Model?
The problem with old (sequential) models
Early models (RNNs) read text one word at a time, like reading a novel through a straw. In a long sentence β "The cat, which had been sleeping peacefully on the warm windowsill all afternoon, suddenly jumped" β by the time the model reaches "jumped" it has largely forgotten "the cat". This is the problem of context loss.
10.2 The Transformer Architecture
| Old: RNN (sequential) | New: Transformer (parallel) | |
|---|---|---|
| Reading style | One word at a time | All words at once |
| Analogy | Reading through a narrow straw | Seeing the whole picture instantly |
| Problem | Forgets the start by the end | Full context understood |
| Speed | Slow β cannot parallelise | Fast β parallel on GPUs |
The Transformer enabled ChatGPT, GPT-4, BERT and virtually every modern LLM.
Encoder and Decoder
- Encoder (Understanding) β reads input text and builds a rich, contextual understanding of its meaning. Like a student listening carefully and taking notes.
- Decoder (Generating) β takes that understanding and generates new text, one word at a time. Like the same student writing an essay from their notes.
Inside the blocks
| Encoder block | Decoder block |
|---|---|
| Self-Attention β who relates to whom | Masked Attention β same, but cannot "peek" at future words |
| Add & Norm β stabilise / "reality check" | Cross-Attention β asks the Encoder for help |
| Feed Forward β process the meaning | Feed Forward + Next-Word Prediction |
10.3 The Self-Attention Mechanism
Query, Key, Value
Every word produces three vectors via learned weight matrices:
| Vector | Meaning | Analogy |
|---|---|---|
| Q β Query | What I'm looking for | "What information do I need?" |
| K β Key | What I offer | "Here's what I can tell you about" |
| V β Value | What I actually say | The real information content passed along |
| dk | Dimension of K | A scaling factor for numerical stability |
The attention formula
The four steps
- Score β take the dot product of a word's Query with every word's Key. A high dot product = high relevance.
- Scale β divide the scores by βdk to keep the numbers stable during training.
- Softmax β convert the scaled scores into attention weights (positive, summing to 1) β percentages of attention.
- Weighted sum β the new representation of the word is the weighted sum of all Value vectors, using those attention weights.
Why self-attention is revolutionary
- Global context β every word can directly attend to every other word, regardless of distance. No information bottleneck.
- Parallel processing β all attention computations happen at once β massive GPU speed-ups.
- Multi-Head Attention β run several attention mechanisms in parallel to capture different relationship types (syntax, semantics) simultaneously.
10.4 Tokenization & Positional Encoding
Tokenization (text β numbers)
Each word is converted to an ID (an index, like a menu item number), then to a vector (the actual meaning/embedding). Computers need the simple ID to look up the complex math.
Positional Encoding
10.5 Ethics & Responsible AI
LLMs are trained on a vast snapshot of the public internet β so they learn human knowledge and human biases, errors and misinformation. The model cannot tell a peer-reviewed paper from a conspiracy blog; it treats all text as equally valid.
Failure Mode 1 β Bias
Example: a rΓ©sumΓ©-screening AI learned to reject female candidates because historical hiring data was mostly men. If training data shows most CEOs are men, the model may complete "The CEO finished hisβ¦" with gendered language.
Failure Mode 2 β Hallucinations
Real case: a lawyer used ChatGPT to research legal cases; it confidently cited six cases β all completely fabricated, with made-up names and judges β and he submitted them to court before discovering they didn't exist. Never trust an LLM blindly for medical, legal or financial advice. LLMs are probabilistic, not deterministic.
Preventing misinformation
- Biased training data β audit datasets for representation and balance.
- Feature selection β carefully evaluate which attributes are truly relevant and fair.
- Evaluation metrics β test performance across different demographic groups, not just overall accuracy.
- Deployment context β continuous monitoring and human oversight of AI decisions.
The four principles of Responsible AI
| Principle | Meaning | Analogy |
|---|---|---|
| Fairness | The system should not create or reinforce unfair bias | A fair referee applying the same rules to everyone |
| Transparency | We should understand how the system makes decisions | Showing your work on a maths problem |
| Accountability | People are responsible for AI outcomes | A captain responsible for their ship |
| Safety & Privacy | Systems must be reliable, secure and minimise harm | Seatbelts in cars |
temperature to 0 also makes outputs more factual and deterministic. (RAG, Lecture 16, grounds answers in real documents.)
Self-attention and responsible AI are core exam topics.
At its core, an LLM works mostly by:
An LLM is a next-word prediction engine. It calculates probabilities over possible next tokens from patterns learned in training β it does not understand or browse live.
The key innovation of the Transformer over older RNN models is that it:
Transformers process the whole sequence at once and can relate distant words, solving the context-loss problem of sequential RNNs and enabling parallel GPU training.
Which part of a Transformer generates the output text word-by-word?
The Encoder builds an understanding of the input; the Decoder uses that understanding to generate the output one word at a time.
In the self-attention formula, which vector represents "what I'm looking for"?
Query = "what do I need?", Key = "what do I offer?", Value = "the actual information". A word's Query is matched against all Keys.
In the attention mechanism, what does the softmax step produce?
Softmax converts the scaled relevance scores into attention weights (percentages summing to 1), which then weight the Value vectors.
Without Positional Encoding, a Transformer would:
Because all words are processed simultaneously, the model has no inherent sense of order. Positional encoding adds a position tag so word order is preserved.
Why does the Decoder use "Masked" attention?
When generating word N, the decoder must only see words 1β¦Nβ1. Masking hides future positions so generation stays causal.
A "hallucination" in an LLM means:
An LLM predicts what sounds right, not what is right. With no fact-checking, it can fluently produce false but plausible information.
A hiring AI rejects female candidates. The most accurate explanation is:
AI bias comes from patterns in historical data, not intent. The data showed mostly men were hired, so the model repeated that pattern.
A word's Query vector is [2, 1] and another word's Key vector is [2, 2]. Compute their attention (dot-product) score.
Dot product = (2Γ2) + (1Γ2) = 4 + 2 = 6. A higher dot product means higher relevance between the two words.
Write a numpy function softmax(scores) that converts raw attention scores into weights summing to 1. Test on [5, 1, 6].
import numpy as np
def softmax(scores):
scores = np.array(scores)
exp = np.exp(scores - np.max(scores)) # subtract max for stability
return exp / exp.sum()
weights = softmax([5, 1, 6])
print("Weights:", weights.round(3))
print("Sum :", weights.sum())
Softmax converts the scores into attention weights β the third word (score 6) gets the most attention (73%).
In one or two sentences, explain how self-attention helps the model resolve the word "it" in the sentence "The trophy doesn't fit in the suitcase because it is too large."
When processing "it", self-attention compares its Query vector against the Key vectors of every other word and computes relevance scores. The word "trophy" earns a high attention weight, so the model's representation of "it" is built largely from "trophy" β correctly resolving the reference.
Name the four principles of Responsible AI and give a one-line meaning for each.
Fairness β the system should not create or reinforce unfair bias. Transparency β we should understand how it makes decisions. Accountability β humans remain responsible for the AI's outcomes. Safety & Privacy β the system must be reliable, secure and minimise harm to people.