Natural Language Processing
Bridging human language and machine understanding. Learn how raw, messy text is cleaned, tokenized, turned into numbers, and used to model sequences — the foundation of every LLM.
In this lecture
9.1 Why NLP needs numbers
Computers only understand numbers. Human language — words, sentences, paragraphs — is meaningless to a machine until it is converted into numerical vectors. The whole NLP pipeline is about that conversion: raw text → clean text → tokens → numbers → vectors that preserve meaning.
Applications: virtual assistants (Siri, Alexa), machine translation, sentiment analysis, chatbots, spam filtering, autocomplete.
9.2 Text Preprocessing
To a computer, "Apple", "apple" and "apple!" are three different words. Preprocessing normalises text so the model is not confused by trivial differences.
Lowercasing
Converts all text to lowercase ("HELLO" → "hello"). Reduces vocabulary size and prevents the same word being treated as different tokens. Caution: case can matter sometimes ("Apple" the company vs "apple" the fruit).
Punctuation removal
Removes symbols like ! , . ? #. Reduces noise and lets the model focus on core words. Caution: punctuation occasionally carries meaning.
Stop word removal
Stemming vs Lemmatization
We want "waiting", "waited", "waits" to map to one root so the model recognises them as related.
| Stemming | Lemmatization | |
|---|---|---|
| Method | Crude chopping of word endings | Dictionary look-up to the real root word |
| Speed | Fast | Slower |
| Accuracy | Lower — may produce non-words | Higher — always produces real words |
| Example "studies" | → "studi" (not a real word) | → "study" (correct) |
| Example "deliveries" | → "deliveri" | → "delivery" |
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
text = "Ugh... The deliveries were DELAYED!! I hate waiting"
# 1. Lowercase
text = text.lower()
# 2. Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# 3. Remove stop words
tokens = [w for w in text.split()
if w not in stopwords.words('english')]
print("Cleaned tokens:", tokens)
# 4. Stem vs Lemmatize
print("studies ->", PorterStemmer().stem("studies"),
"|", WordNetLemmatizer().lemmatize("studies"))
9.3 Tokenization
Two strategies
- Word tokenization (simple split) — split on spaces.
"I love AI"→["I", "love", "AI"]. Simple but fails on rare/unknown words. - Sub-word tokenization (BERT/GPT style) — breaks rare words into known chunks.
"microtransactional"→["micro", "##tra", "##ns", "##act", "##ional"]. This handles any word, even made-up ones.
Tokenizer.fit_on_texts() builds the vocabulary; texts_to_sequences() converts sentences to integer lists; word_index is the word→integer dictionary. An oov_token handles Out-Of-Vocabulary (unseen) words.
9.4 Word Embeddings
Tokens are still text. Models need numbers. There are three ways to turn words into numbers — each better than the last.
Naive approach — number the words
Apple=1, Banana=2, Carrot=3… Fails, because it invents a fake ordering (Carrot > Apple?) the same way label-encoding nominal data does.
One-Hot Encoding
Each word becomes a binary vector the size of the whole vocabulary — one position is 1, the rest 0. Problems: (1) huge & sparse — a 50,000-word vocabulary gives 50,000-length vectors that are almost all zeros; (2) "dumb" — it treats "happy" and "joy" as completely unrelated.
Bag of Words (BoW)
Represents a sentence by word counts over the vocabulary. "Good movie" and "Movie is good" get similar vectors because they share words. Still sparse and ignores word order and meaning.
Word Embeddings — the modern solution
Key advantages over one-hot:
- Dense & compact — 50–300 dimensions instead of 50,000.
- Semantic — "king" and "queen" sit close together in vector space; "king" and "apple" sit far apart.
- Learned — the vectors are learned automatically during training.
[0.99, 0.99, 0.05] (royal, male, not-food). Queen → [0.99, 0.05, 0.05] (royal, female, not-food). Apple → [0.01, 0.10, 0.95] (not-royal, neutral, food). The model automatically learned King and Queen are related — their vectors are nearly identical.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Win free money", "Win huge money"]
bow = CountVectorizer().fit_transform(corpus)
print("BoW matrix:\n", bow.toarray())
# Modern: pre-trained sentence embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vecs = model.encode(["King", "Queen", "Apple"])
print("Embedding length:", len(vecs[0])) # 384 dense numbers
Padding
padding='pre'→ zeros added at the beginning.padding='post'→ zeros added at the end.maxlen→ the final fixed length.
Example: [90, 400, 22, 10, 5] padded to length 9 (post) → [90, 400, 22, 10, 5, 0, 0, 0, 0]. The zeros act like "empty desks" the model ignores.
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences sentences = ["I love AI", "The deliveries were delayed", "Win free money"] tokenizer = Tokenizer(oov_token="<OOV>") tokenizer.fit_on_texts(sentences) seqs = tokenizer.texts_to_sequences(sentences) padded = pad_sequences(seqs, padding='post') print(padded)
9.5 Sequential Modelling & Language Foundations
Next-word prediction
The core task behind text generation and autocomplete: given a sequence of words, predict the most likely next word. Given "The cat sat on the", a good model predicts "mat" or "floor".
The next-word-prediction workflow
- Tokenize & integer-encode the text.
- Generate N-gram sequences (progressively longer chunks).
- Pad the sequences to equal length.
- Split into features X (the context) and target y (the next word).
- Apply an Embedding layer.
- Process through an RNN.
- A Dense output layer with Softmax picks the next word.
Text generation
Once trained, the model generates text iteratively: start with a seed ("The Sky"), predict the next word ("is"), append it ("The Sky is"), and repeat. Modern LLMs like GPT (Lecture 10) do exactly this — but with Transformer architectures and billions of parameters, producing remarkably human-like text.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=16)) # word -> dense vector
model.add(SimpleRNN(32)) # read the sequence
model.add(Dense(1, activation='sigmoid')) # spam vs not-spam
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(padded_data, labels, epochs=5, validation_split=0.2)
NLP preprocessing details are heavily tested — work through every one.
Why must text be converted into numbers before an ML model can use it?
ML models are mathematical equations — they multiply, add and subtract numbers. Raw text must be converted into numerical vectors that preserve meaning.
Tokenization is the process of:
Tokenization splits continuous text into discrete tokens (words, sub-words or characters) so the model can process each unit.
Which technique uses a dictionary look-up to always produce a real root word?
Lemmatization uses a dictionary, so "studies" → "study". Stemming crudely chops endings and can produce non-words ("studies" → "studi").
The key advantage of word embeddings over one-hot encoding is that embeddings:
One-hot vectors are huge, sparse and "dumb" (treat happy/joy as unrelated). Embeddings are compact, dense, and place similar-meaning words close together.
Why can blindly removing stop words be dangerous?
"The movie was NOT good" → "movie good" reverses the sentiment. Sophisticated models (RNNs, Transformers) usually keep stop words for this reason.
Why is padding needed before feeding sequences into a neural network?
Neural networks need every input to be the same length. Padding adds zeros to shorter sequences so they all reach a common maxlen.
Next-word prediction relies on the fact that:
Sequential models exploit temporal dependency — "Dog bites man" ≠ "Man bites dog" — to predict the most likely continuation.
Pad the sequence [12, 7, 33] to length 6 using post-padding. What is the result?
Post-padding adds zeros at the end until the sequence reaches the target length of 6. Pre-padding would give [0, 0, 0, 12, 7, 33].
Write Python code that lowercases a string and removes all punctuation from it.
import string
text = "Ugh... The DELIVERIES were Delayed!!"
# Lowercase
text = text.lower()
# Remove punctuation: maketrans builds a translation table
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)
str.maketrans('', '', string.punctuation) creates a table that maps every punctuation character to nothing.
Using Keras, tokenize a list of sentences and pad them to a maximum length of 5 with post-padding.
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences sentences = ["I love NLP", "Win free money now"] tokenizer = Tokenizer(oov_token="<OOV>") tokenizer.fit_on_texts(sentences) seqs = tokenizer.texts_to_sequences(sentences) padded = pad_sequences(seqs, maxlen=5, padding='post') print(padded)
Give two reasons one-hot encoding is a poor way to represent words for large vocabularies.
(1) Huge and sparse — with a 50,000-word vocabulary, every word vector is 50,000 numbers long and almost entirely zeros, wasting memory and compute. (2) No meaning — one-hot treats every word as equally distinct, so "happy" and "joy" appear completely unrelated. Word embeddings fix both: they are dense (50–300 dims) and place similar words close together.
List, in order, the main steps to take a raw sentence and prepare it for an RNN.
(1) Clean the text — lowercase, remove punctuation. (2) Tokenize into words. (3) Optionally remove stop words / stem or lemmatize. (4) Integer-encode each token (build a vocabulary). (5) Pad all sequences to a fixed length. (6) Pass through an Embedding layer to get dense vectors. (7) Feed into the RNN.