Types of Neural Networks
From one neuron to deep networks. Learn the Multi-Layer Perceptron, the activation functions that give networks their power, and Recurrent Neural Networks that handle sequences.
In this lecture
8.1 Multi-Layer Perceptron (MLP)
Lecture 7 showed a single perceptron is a linear classifier — it cannot solve XOR. The fix is to stack neurons into layers.
The three kinds of layers
- Input layer — receives the raw features X.
- Hidden layer(s) — perform feature extraction; "hidden" because you don't see their internal work.
- Output layer — produces the final prediction.
Forward propagation
Data flows input → output. At each neuron: (1) compute the weighted sum Z = WX + B, then (2) apply a non-linear activation A = f(Z).
8.2 Activation Functions
| Function | Range | Use / Notes |
|---|---|---|
| Sigmoid σ(z)=1/(1+e⁻ᶻ) | (0, 1) | Binary classification output (probability). Con: vanishing gradient. |
| Tanh hyperbolic tangent | (−1, 1) | Zero-centred (better than sigmoid). Con: still suffers vanishing gradients. |
| ReLU max(0, z) | [0, ∞) | Industry standard for hidden layers. Fast, fixes vanishing gradient for positive z. Con: "Dead ReLU" (neurons stuck at 0). |
| Softmax | (0, 1), sums to 1 | Output layer for multi-class classification — turns raw scores (logits) into probabilities that sum to 1. |
[2.0, 1.0, 0.1] → Softmax → [0.7, 0.2, 0.1]. The outputs are now probabilities that sum to 1.0, so the network can say "70% class A, 20% class B, 10% class C".
8.3 Loss Functions & Optimizers
Loss functions — the scorecard
- MSE (Mean Squared Error) — for regression (predicting numbers).
- Cross-Entropy / Categorical Cross-Entropy — for classification; heavily penalises confident wrong predictions.
Optimizers — how the weights are updated
- SGD — basic stochastic gradient descent; can be slow and stick in local minima.
- Momentum — accumulates velocity in a consistent direction, dampening oscillations.
- Adam — combines momentum with adaptive learning rates; the default choice for most problems.
from keras import models, layers
from keras.layers import Input
# MLP for classifying 28x28 images (flattened to 784) into 10 classes
model = models.Sequential([
Input(shape=(784,)),
layers.Dense(64, activation='relu'), # hidden layer 1
layers.Dense(32, activation='relu'), # hidden layer 2
layers.Dense(10, activation='softmax') # output: 10 classes
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10,
validation_data=(x_test, y_test))
8.4 Recurrent Neural Networks (RNN)
How an RNN remembers
At each time step t, the RNN combines the current input with the previous hidden state:
This loop means the hidden state at the end is a (nested) function of all earlier inputs — so the network "remembers" context when predicting the next word.
What RNNs are used for
- Next-word prediction & text generation
- Sentiment classification of sentences
- Machine translation
- Time-series forecasting (stock prices, weather)
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
model = Sequential()
model.add(Embedding(input_dim=100, output_dim=8)) # word -> vector
model.add(SimpleRNN(16)) # the sequence reader
model.add(Dense(1, activation='sigmoid')) # 0 = negative, 1 = positive
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(padded_data, labels, epochs=50)
preds = model.predict(test_pad) # e.g. "movie was good"
print("Prediction:", preds)
8.5 Beyond MLPs — CNNs & Transformers
Standard dense (MLP) layers ignore structure. Specialised architectures handle structured data better:
| Architecture | Designed for | Key idea |
|---|---|---|
| CNN (Convolutional NN) | Images / spatial data | Filters that capture spatial patterns (edges, shapes) |
| RNN / LSTM | Sequences / time-series | Hidden state carries memory across time steps |
| Transformer | Text / sequences (modern) | Self-attention — processes the whole sequence at once (Lecture 10) |
Activation functions and RNN concepts are exam favourites.
What allows a Multi-Layer Perceptron to learn non-linear patterns that a single perceptron cannot?
Hidden layers plus non-linear activations let the network bend decision boundaries. Without non-linear activations, stacked layers collapse into one linear function.
Which activation function is the industry standard for hidden layers?
ReLU = max(0, z) is fast and avoids the vanishing-gradient problem for positive inputs, making it the default for hidden layers.
For the output layer of a 10-class classification problem, you should use:
Softmax converts raw scores into probabilities that sum to 1 across all classes — ideal for multi-class output. Sigmoid is for binary output.
If a deep network used no activation functions (purely linear), what would happen?
A composition of linear functions is still linear. Non-linear activations are what let multiple layers represent complex, curved patterns.
What makes an RNN suitable for sequential data like text?
The hidden state acts as memory — it passes context from earlier steps forward, so word order and history influence the prediction.
Which optimizer is the common default choice, combining momentum with adaptive learning rates?
Adam (Adaptive Moment Estimation) blends momentum with per-parameter adaptive learning rates and is the practical default. ReLU/Softmax are activations, not optimizers.
The vanishing gradient problem in deep networks using sigmoid is best mitigated by:
ReLU keeps a gradient of 1 for positive inputs, so gradients do not shrink toward zero through many layers — unlike sigmoid/tanh which squash values.
Apply the ReLU activation to each value: −3, 0, 2.5, −0.1, 7.
ReLU(z) = max(0, z). Negative values and 0 become 0; positive values pass through unchanged: −3→0, 0→0, 2.5→2.5, −0.1→0, 7→7.
Using Keras, build an MLP for binary classification with input dimension 20, two hidden layers (16 and 8 neurons, ReLU), and a suitable output layer. Compile it.
from keras import models, layers
from keras.layers import Input
model = models.Sequential([
Input(shape=(20,)),
layers.Dense(16, activation='relu'), # hidden 1
layers.Dense(8, activation='relu'), # hidden 2
layers.Dense(1, activation='sigmoid') # binary output
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Binary output → 1 neuron + sigmoid + binary_crossentropy loss.
Build a Keras model with an Embedding layer, a SimpleRNN layer, and a sigmoid output, for binary sentiment classification.
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=16)) # words -> vectors
model.add(SimpleRNN(32)) # reads the sequence
model.add(Dense(1, activation='sigmoid')) # 0/1 sentiment
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
Embedding turns word indices into dense vectors; SimpleRNN processes them in order; the sigmoid output gives a sentiment probability.
Why is the choice of output-layer activation different for regression, binary classification and multi-class classification?
The output activation must match the kind of answer needed. Regression needs any real number, so the output is linear (no activation). Binary classification needs a single probability in (0,1), so sigmoid. Multi-class needs probabilities across many classes that sum to 1, so softmax.
Why might a basic RNN struggle to remember the start of a very long sentence, and what architectures fix this?
Over long sequences, gradients are multiplied many times and shrink toward zero (vanishing gradient), so the RNN effectively "forgets" early information. LSTM and GRU add gating mechanisms to retain long-range context, and Transformers (Lecture 10) use self-attention to access any position directly.