GenAI Exam Prep
Home Mock Exam
⚡ LECTURE 9

Natural Language Processing

Bridging human language and machine understanding. Learn how raw, messy text is cleaned, tokenized, turned into numbers, and used to model sequences — the foundation of every LLM.

Syllabus topics 30–34 ⏱ ~27 min read 12 practice questions

9.1 Why NLP needs numbers

Natural Language Processing (NLP) — a branch of AI that enables computers to understand, interpret and generate human language. It combines computational linguistics with machine learning.

Computers only understand numbers. Human language — words, sentences, paragraphs — is meaningless to a machine until it is converted into numerical vectors. The whole NLP pipeline is about that conversion: raw text → clean text → tokens → numbers → vectors that preserve meaning.

Applications: virtual assistants (Siri, Alexa), machine translation, sentiment analysis, chatbots, spam filtering, autocomplete.

🔑 The NLP pipeline — memorise the order Clean text → Tokenize → Remove stop words → Stem/Lemmatize → Vectorize (Bag of Words / Embeddings) → Pad → feed to model.

9.2 Text Preprocessing

To a computer, "Apple", "apple" and "apple!" are three different words. Preprocessing normalises text so the model is not confused by trivial differences.

Lowercasing

Converts all text to lowercase ("HELLO" → "hello"). Reduces vocabulary size and prevents the same word being treated as different tokens. Caution: case can matter sometimes ("Apple" the company vs "apple" the fruit).

Punctuation removal

Removes symbols like ! , . ? #. Reduces noise and lets the model focus on core words. Caution: punctuation occasionally carries meaning.

Stop word removal

Stop words — extremely common words ("the", "is", "at", "were") that provide grammatical structure but little meaning. Removing them lets the model focus on content words (nouns, verbs, adjectives).
⚠️ The "not" trap Blindly removing stop words can flip meaning: "The movie was NOT good" → "movie good" (meaning reversed!). Rule: simple models (Naive Bayes, Logistic Regression on IoT/low-memory devices) → remove stop words for speed. Sophisticated models (RNNs, Transformers) → keep stop words, because they are smart enough to use "not".

Stemming vs Lemmatization

We want "waiting", "waited", "waits" to map to one root so the model recognises them as related.

StemmingLemmatization
MethodCrude chopping of word endingsDictionary look-up to the real root word
SpeedFastSlower
AccuracyLower — may produce non-wordsHigher — always produces real words
Example "studies"→ "studi" (not a real word)→ "study" (correct)
Example "deliveries"→ "deliveri"→ "delivery"
Python · cleaning text
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "Ugh... The deliveries were DELAYED!! I hate waiting"

# 1. Lowercase
text = text.lower()
# 2. Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# 3. Remove stop words
tokens = [w for w in text.split()
          if w not in stopwords.words('english')]
print("Cleaned tokens:", tokens)

# 4. Stem vs Lemmatize
print("studies ->", PorterStemmer().stem("studies"),
      "|", WordNetLemmatizer().lemmatize("studies"))
OutputCleaned tokens: ['ugh', 'deliveries', 'delayed', 'hate', 'waiting'] studies -> studi | study

9.3 Tokenization

Tokenization — breaking text into smaller units called tokens. Tokens can be words, sub-words, or characters. This transforms continuous text into discrete elements a model can process.

Two strategies

💡 Tip — integer encoding with Keras Tokenizer After tokenizing, each unique word gets an integer index. Tokenizer.fit_on_texts() builds the vocabulary; texts_to_sequences() converts sentences to integer lists; word_index is the word→integer dictionary. An oov_token handles Out-Of-Vocabulary (unseen) words.

9.4 Word Embeddings

Tokens are still text. Models need numbers. There are three ways to turn words into numbers — each better than the last.

Naive approach — number the words

Apple=1, Banana=2, Carrot=3… Fails, because it invents a fake ordering (Carrot > Apple?) the same way label-encoding nominal data does.

One-Hot Encoding

Each word becomes a binary vector the size of the whole vocabulary — one position is 1, the rest 0. Problems: (1) huge & sparse — a 50,000-word vocabulary gives 50,000-length vectors that are almost all zeros; (2) "dumb" — it treats "happy" and "joy" as completely unrelated.

Bag of Words (BoW)

Represents a sentence by word counts over the vocabulary. "Good movie" and "Movie is good" get similar vectors because they share words. Still sparse and ignores word order and meaning.

Word Embeddings — the modern solution

Word Embedding — a dense, low-dimensional vector (typically 50–300 numbers) that captures a word's meaning. Words with similar meanings get similar vectors.

Key advantages over one-hot:

🧩 The "DNA" of words King → [0.99, 0.99, 0.05] (royal, male, not-food). Queen → [0.99, 0.05, 0.05] (royal, female, not-food). Apple → [0.01, 0.10, 0.95] (not-royal, neutral, food). The model automatically learned King and Queen are related — their vectors are nearly identical.
Python · Bag of Words vs Embeddings
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Win free money", "Win huge money"]
bow = CountVectorizer().fit_transform(corpus)
print("BoW matrix:\n", bow.toarray())

# Modern: pre-trained sentence embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vecs = model.encode(["King", "Queen", "Apple"])
print("Embedding length:", len(vecs[0]))   # 384 dense numbers
OutputBoW matrix: [[1 0 1 1] [0 1 1 1]] Embedding length: 384

Padding

Padding — neural networks need fixed-length inputs, but sentences vary in length. Padding adds zeros to shorter sequences so all are the same length.

Example: [90, 400, 22, 10, 5] padded to length 9 (post) → [90, 400, 22, 10, 5, 0, 0, 0, 0]. The zeros act like "empty desks" the model ignores.

Python · tokenize + pad with Keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ["I love AI", "The deliveries were delayed", "Win free money"]

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
seqs = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(seqs, padding='post')
print(padded)
Output[[ 2 3 4 0] [ 5 6 7 8] [ 9 10 11 0]]

9.5 Sequential Modelling & Language Foundations

🔑 Language is sequential Word order matters enormously: "Dog bites man" ≠ "Man bites dog". Sequential models capture this temporal dependency — they learn the patterns of how words follow one another.

Next-word prediction

The core task behind text generation and autocomplete: given a sequence of words, predict the most likely next word. Given "The cat sat on the", a good model predicts "mat" or "floor".

The next-word-prediction workflow

  1. Tokenize & integer-encode the text.
  2. Generate N-gram sequences (progressively longer chunks).
  3. Pad the sequences to equal length.
  4. Split into features X (the context) and target y (the next word).
  5. Apply an Embedding layer.
  6. Process through an RNN.
  7. A Dense output layer with Softmax picks the next word.

Text generation

Once trained, the model generates text iteratively: start with a seed ("The Sky"), predict the next word ("is"), append it ("The Sky is"), and repeat. Modern LLMs like GPT (Lecture 10) do exactly this — but with Transformer architectures and billions of parameters, producing remarkably human-like text.

Python · NLP classifier — Embedding + RNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=16))  # word -> dense vector
model.add(SimpleRNN(32))                             # read the sequence
model.add(Dense(1, activation='sigmoid'))            # spam vs not-spam

model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.fit(padded_data, labels, epochs=5, validation_split=0.2)
? Practice Questions

NLP preprocessing details are heavily tested — work through every one.

MCQQ1Basics

Why must text be converted into numbers before an ML model can use it?

  • A Numbers take less storage
  • B Models are mathematical — they can only perform operations on numbers
  • C Text cannot be stored on a computer
  • D It makes the text shorter
Answer: B

ML models are mathematical equations — they multiply, add and subtract numbers. Raw text must be converted into numerical vectors that preserve meaning.

MCQQ2Tokenization

Tokenization is the process of:

  • A Removing all punctuation
  • B Breaking text into smaller units (tokens) like words or sub-words
  • C Translating text to another language
  • D Encrypting the text
Answer: B

Tokenization splits continuous text into discrete tokens (words, sub-words or characters) so the model can process each unit.

MCQQ3Stemming/Lemmatization

Which technique uses a dictionary look-up to always produce a real root word?

  • A Stemming
  • B Lemmatization
  • C Padding
  • D Tokenization
Answer: B

Lemmatization uses a dictionary, so "studies" → "study". Stemming crudely chops endings and can produce non-words ("studies" → "studi").

MCQQ4Embeddings

The key advantage of word embeddings over one-hot encoding is that embeddings:

  • A Are always exactly 50,000 dimensions long
  • B Are dense and capture semantic meaning (similar words → similar vectors)
  • C Remove the need for any training
  • D Only work for the English language
Answer: B

One-hot vectors are huge, sparse and "dumb" (treat happy/joy as unrelated). Embeddings are compact, dense, and place similar-meaning words close together.

MCQQ5Stop words

Why can blindly removing stop words be dangerous?

  • A It makes the text longer
  • B Removing words like "not" can flip the sentence's meaning
  • C It deletes all the nouns
  • D It encrypts the text
Answer: B

"The movie was NOT good" → "movie good" reverses the sentiment. Sophisticated models (RNNs, Transformers) usually keep stop words for this reason.

MCQQ6Padding

Why is padding needed before feeding sequences into a neural network?

  • A To translate the text
  • B Networks require fixed-length inputs, but sentences vary in length
  • C To remove stop words
  • D To increase the vocabulary size
Answer: B

Neural networks need every input to be the same length. Padding adds zeros to shorter sequences so they all reach a common maxlen.

MCQQ7Sequential modelling

Next-word prediction relies on the fact that:

  • A Word order does not matter
  • B Language is sequential — each word depends on the words before it
  • C All sentences have the same length
  • D Punctuation determines meaning
Answer: B

Sequential models exploit temporal dependency — "Dog bites man" ≠ "Man bites dog" — to predict the most likely continuation.

NumericalQ8Padding

Pad the sequence [12, 7, 33] to length 6 using post-padding. What is the result?

Answer: [12, 7, 33, 0, 0, 0]

Post-padding adds zeros at the end until the sequence reaches the target length of 6. Pre-padding would give [0, 0, 0, 12, 7, 33].

CodingQ9Text cleaning

Write Python code that lowercases a string and removes all punctuation from it.

Solution
Python
import string

text = "Ugh... The DELIVERIES were Delayed!!"

# Lowercase
text = text.lower()
# Remove punctuation: maketrans builds a translation table
text = text.translate(str.maketrans('', '', string.punctuation))

print(text)
Outputugh the deliveries were delayed

str.maketrans('', '', string.punctuation) creates a table that maps every punctuation character to nothing.

CodingQ10Tokenize & pad

Using Keras, tokenize a list of sentences and pad them to a maximum length of 5 with post-padding.

Solution
Python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ["I love NLP", "Win free money now"]

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
seqs = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(seqs, maxlen=5, padding='post')
print(padded)
Output[[2 3 4 0 0] [5 6 7 8 0]]
Short AnswerQ11Concept

Give two reasons one-hot encoding is a poor way to represent words for large vocabularies.

Model answer

(1) Huge and sparse — with a 50,000-word vocabulary, every word vector is 50,000 numbers long and almost entirely zeros, wasting memory and compute. (2) No meaning — one-hot treats every word as equally distinct, so "happy" and "joy" appear completely unrelated. Word embeddings fix both: they are dense (50–300 dims) and place similar words close together.

Short AnswerQ12Pipeline

List, in order, the main steps to take a raw sentence and prepare it for an RNN.

Model answer

(1) Clean the text — lowercase, remove punctuation. (2) Tokenize into words. (3) Optionally remove stop words / stem or lemmatize. (4) Integer-encode each token (build a vocabulary). (5) Pad all sequences to a fixed length. (6) Pass through an Embedding layer to get dense vectors. (7) Feed into the RNN.

🎯 Lecture 9 — must-remember Pipeline: clean → tokenize → stop words → stem/lemmatize → vectorize → pad. Stemming = crude chop (fast, can give non-words); Lemmatization = dictionary (accurate). One-hot = huge & sparse; embeddings = dense & semantic. Padding gives fixed-length inputs (pre/post). Sequential models exploit word order for next-word prediction.