⚡ LECTURE 8

Types of Neural Networks

From one neuron to deep networks. Learn the Multi-Layer Perceptron, the activation functions that give networks their power, and Recurrent Neural Networks that handle sequences.

Syllabus topics 27–29 ⏱ ~25 min read 12 practice questions

In this lecture

Multi-Layer Perceptron (MLP)
Activation Functions
Loss Functions & Optimizers
Recurrent Neural Networks (RNN)
Beyond MLPs — CNNs & Transformers
Practice Questions

8.1 Multi-Layer Perceptron (MLP)

Lecture 7 showed a single perceptron is a linear classifier — it cannot solve XOR. The fix is to stack neurons into layers.

Multi-Layer Perceptron (MLP) — a neural network with an input layer, one or more hidden layers, and an output layer. The hidden layers let it learn non-linear decision boundaries.

The three kinds of layers

Input layer — receives the raw features X.
Hidden layer(s) — perform feature extraction; "hidden" because you don't see their internal work.
Output layer — produces the final prediction.

🍕 The Pizza Factory analogy A single ingredient is not a pizza — you need steps. Input layer = raw ingredients (flour, water, tomato). Hidden layer = the chefs: Chef A mixes flour+water → dough; Chef B mixes tomato+spices → sauce. They create useful intermediate features. Output layer = the server combines dough+sauce+cheese → the final pizza. The chefs are "hidden" — you only see the menu and the meal.

🔑 Universal Approximation Theorem A network with just one hidden layer and enough neurons can approximate any continuous function. This is why MLPs are so powerful — and why stacking layers solves XOR and other non-linear problems.

Forward propagation

Data flows input → output. At each neuron: (1) compute the weighted sum Z = WX + B, then (2) apply a non-linear activation A = f(Z).

8.2 Activation Functions

🔑 Why activation functions exist An activation function adds non-linearity. Without it, stacking layers would just be a chain of linear operations — the whole network would collapse into one big linear function and could never learn curves. Activation functions are what make deep networks powerful.

Function	Range	Use / Notes
Sigmoid σ(z)=1/(1+e⁻ᶻ)	(0, 1)	Binary classification output (probability). Con: vanishing gradient.
Tanh hyperbolic tangent	(−1, 1)	Zero-centred (better than sigmoid). Con: still suffers vanishing gradients.
ReLU max(0, z)	[0, ∞)	Industry standard for hidden layers. Fast, fixes vanishing gradient for positive z. Con: "Dead ReLU" (neurons stuck at 0).
Softmax	(0, 1), sums to 1	Output layer for multi-class classification — turns raw scores (logits) into probabilities that sum to 1.

🧩 Softmax example Raw scores (logits) [2.0, 1.0, 0.1] → Softmax → [0.7, 0.2, 0.1]. The outputs are now probabilities that sum to 1.0, so the network can say "70% class A, 20% class B, 10% class C".

⚠️ The Vanishing Gradient problem Sigmoid and tanh "squash" inputs into a small range. During backpropagation through many layers, gradients get multiplied repeatedly and shrink toward zero — so early layers barely learn. Solution: use ReLU in hidden layers, which keeps a gradient of 1 for positive inputs.

💡 Tip — the standard recipe Hidden layers → ReLU. Output layer for binary classification → Sigmoid. Output layer for multi-class → Softmax. Output layer for regression → no activation (linear). Memorise this — it appears in coding questions constantly.

8.3 Loss Functions & Optimizers

Loss functions — the scorecard

MSE (Mean Squared Error) — for regression (predicting numbers).
Cross-Entropy / Categorical Cross-Entropy — for classification; heavily penalises confident wrong predictions.

Optimizers — how the weights are updated

SGD — basic stochastic gradient descent; can be slow and stick in local minima.
Momentum — accumulates velocity in a consistent direction, dampening oscillations.
Adam — combines momentum with adaptive learning rates; the default choice for most problems.

Python · building an MLP with Keras

from keras import models, layers
from keras.layers import Input

# MLP for classifying 28x28 images (flattened to 784) into 10 classes
model = models.Sequential([
    Input(shape=(784,)),
    layers.Dense(64, activation='relu'),     # hidden layer 1
    layers.Dense(32, activation='relu'),     # hidden layer 2
    layers.Dense(10, activation='softmax')   # output: 10 classes
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10,
          validation_data=(x_test, y_test))

OutputEpoch 1/10 accuracy: 0.9215 - val_accuracy: 0.9578 Epoch 5/10 accuracy: 0.9819 - val_accuracy: 0.9708 Epoch 10/10 accuracy: 0.9917 - val_accuracy: 0.9760

💡 Tip — MLP vs simpler models on MNIST On the MNIST handwritten-digit dataset, a Decision Tree reaches ~88% test accuracy, Logistic Regression ~92%, but an MLP reaches ~97%+. Hidden layers and non-linear activations let the MLP capture far richer patterns.

8.4 Recurrent Neural Networks (RNN)

⚠️ Why MLPs fail on sequences A standard MLP treats every input independently and has no notion of order. But language and time-series are sequential — "Dog bites man" ≠ "Man bites dog". We need a network with memory.

Recurrent Neural Network (RNN) — a network designed for sequential data. It processes inputs one step at a time and maintains a hidden state that carries information from previous steps — a form of memory.

How an RNN remembers

At each time step t, the RNN combines the current input with the previous hidden state:

h_t = tanh( W_h·h_t−1 + W_x·x_t + b ) h_t−1 is the memory of everything seen so far; x_t is the new word/value.

This loop means the hidden state at the end is a (nested) function of all earlier inputs — so the network "remembers" context when predicting the next word.

What RNNs are used for

Next-word prediction & text generation
Sentiment classification of sentences
Machine translation
Time-series forecasting (stock prices, weather)

Python · an RNN for sentiment classification

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(input_dim=100, output_dim=8))  # word -> vector
model.add(SimpleRNN(16))                           # the sequence reader
model.add(Dense(1, activation='sigmoid'))          # 0 = negative, 1 = positive

model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.fit(padded_data, labels, epochs=50)

preds = model.predict(test_pad)   # e.g. "movie was good"
print("Prediction:", preds)

OutputPrediction: [[0.92]] # 0.92 -> Positive sentiment

⚠️ RNN limitation — long-term memory Basic ("vanilla") RNNs suffer from the vanishing gradient problem over long sequences — they forget information from far back. Improved variants LSTM (Long Short-Term Memory) and GRU add gates to remember long-range context. Transformers (Lecture 10) later replaced RNNs for most NLP tasks.

8.5 Beyond MLPs — CNNs & Transformers

Standard dense (MLP) layers ignore structure. Specialised architectures handle structured data better:

Architecture	Designed for	Key idea
CNN (Convolutional NN)	Images / spatial data	Filters that capture spatial patterns (edges, shapes)
RNN / LSTM	Sequences / time-series	Hidden state carries memory across time steps
Transformer	Text / sequences (modern)	Self-attention — processes the whole sequence at once (Lecture 10)

💡 Tip — diagnosing training with loss curves Watch validation loss: if training loss keeps falling but validation loss starts rising, the network is overfitting → use early stopping, more data, dropout, or a simpler model.

? Practice Questions

Activation functions and RNN concepts are exam favourites.

MCQQ1MLP

What allows a Multi-Layer Perceptron to learn non-linear patterns that a single perceptron cannot?

A A larger learning rate
B Hidden layers combined with non-linear activation functions
C More training epochs only
D Removing the bias terms

Answer: B

Hidden layers plus non-linear activations let the network bend decision boundaries. Without non-linear activations, stacked layers collapse into one linear function.

MCQQ2Activation

Which activation function is the industry standard for hidden layers?

A Sigmoid
B Softmax
C ReLU
D Linear

Answer: C

ReLU = max(0, z) is fast and avoids the vanishing-gradient problem for positive inputs, making it the default for hidden layers.

MCQQ3Activation

For the output layer of a 10-class classification problem, you should use:

A ReLU
B Sigmoid
C Softmax
D Tanh

Answer: C

Softmax converts raw scores into probabilities that sum to 1 across all classes — ideal for multi-class output. Sigmoid is for binary output.

MCQQ4Non-linearity

If a deep network used no activation functions (purely linear), what would happen?

A It would train faster and be more accurate
B All the layers would collapse into a single linear function — unable to learn curves
C It would become a Decision Tree
D Nothing would change

Answer: B

A composition of linear functions is still linear. Non-linear activations are what let multiple layers represent complex, curved patterns.

MCQQ5RNN

What makes an RNN suitable for sequential data like text?

A It has more layers than an MLP
B It keeps a hidden state that carries information from previous time steps
C It never uses activation functions
D It processes all words in random order

Answer: B

The hidden state acts as memory — it passes context from earlier steps forward, so word order and history influence the prediction.

MCQQ6Optimizer

Which optimizer is the common default choice, combining momentum with adaptive learning rates?

A SGD
B Adam
C ReLU
D Softmax

Answer: B

Adam (Adaptive Moment Estimation) blends momentum with per-parameter adaptive learning rates and is the practical default. ReLU/Softmax are activations, not optimizers.

MCQQ7Vanishing gradient

The vanishing gradient problem in deep networks using sigmoid is best mitigated by:

A Removing all hidden layers
B Switching hidden-layer activations to ReLU
C Setting the learning rate to 0
D Using softmax in every layer

Answer: B

ReLU keeps a gradient of 1 for positive inputs, so gradients do not shrink toward zero through many layers — unlike sigmoid/tanh which squash values.

NumericalQ8ReLU

Apply the ReLU activation to each value: −3, 0, 2.5, −0.1, 7.

Answer: 0, 0, 2.5, 0, 7

ReLU(z) = max(0, z). Negative values and 0 become 0; positive values pass through unchanged: −3→0, 0→0, 2.5→2.5, −0.1→0, 7→7.

CodingQ9Build an MLP

Using Keras, build an MLP for binary classification with input dimension 20, two hidden layers (16 and 8 neurons, ReLU), and a suitable output layer. Compile it.

Solution

Python

from keras import models, layers
from keras.layers import Input

model = models.Sequential([
    Input(shape=(20,)),
    layers.Dense(16, activation='relu'),    # hidden 1
    layers.Dense(8,  activation='relu'),    # hidden 2
    layers.Dense(1,  activation='sigmoid')  # binary output
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

OutputModel: "sequential" - Total params: 481 (trainable)

Binary output → 1 neuron + sigmoid + binary_crossentropy loss.

CodingQ10Build an RNN

Build a Keras model with an Embedding layer, a SimpleRNN layer, and a sigmoid output, for binary sentiment classification.

Solution

Python

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=16))  # words -> vectors
model.add(SimpleRNN(32))                             # reads the sequence
model.add(Dense(1, activation='sigmoid'))            # 0/1 sentiment

model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])

Embedding turns word indices into dense vectors; SimpleRNN processes them in order; the sigmoid output gives a sentiment probability.

Short AnswerQ11Concept

Why is the choice of output-layer activation different for regression, binary classification and multi-class classification?

Model answer

The output activation must match the kind of answer needed. Regression needs any real number, so the output is linear (no activation). Binary classification needs a single probability in (0,1), so sigmoid. Multi-class needs probabilities across many classes that sum to 1, so softmax.

Short AnswerQ12RNN

Why might a basic RNN struggle to remember the start of a very long sentence, and what architectures fix this?

Model answer

Over long sequences, gradients are multiplied many times and shrink toward zero (vanishing gradient), so the RNN effectively "forgets" early information. LSTM and GRU add gating mechanisms to retain long-range context, and Transformers (Lecture 10) use self-attention to access any position directly.

🎯 Lecture 8 — must-remember MLP = input + hidden layer(s) + output; hidden layers + non-linear activations enable non-linearity (Universal Approximation Theorem). Activations: ReLU (hidden), Sigmoid (binary out), Softmax (multi-class out), linear (regression out). Loss: MSE (regression), Cross-Entropy (classification). Optimizer default = Adam. RNN keeps a hidden state for sequences; LSTM/GRU fix long-memory.

← Previous

Intro to Neural Networks

Natural Language Processing