⚡ LECTURE 7

Introduction to Neural Networks

Build your first "artificial brain cell". Learn how a perceptron computes a decision, why a single layer is limited, and how networks learn through gradient descent and backpropagation.

Syllabus topics 24–26 ⏱ ~26 min read 12 practice questions

In this lecture

The Artificial Neuron
The Single-Cell Perceptron
The Single-Layer Perceptron & its limits
How a perceptron learns — Gradient Descent
Backpropagation
Practice Questions

7.1 The Artificial Neuron

Neural networks are the engine of Deep Learning — ML with networks of many layers. They are loosely inspired by the brain.

Biological neuron	What it does	Artificial equivalent
Dendrites	Receive signals from other neurons	Inputs (x)
Synapses	Control how strong each signal is	Weights (w)
Cell body / Soma	Adds up all the signals	Summation (z)
Axon	Sends out the final decision	Output (y)

Artificial Neuron — a mathematical unit that takes inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function to produce an output.

7.2 The Single-Cell Perceptron

The perceptron (Rosenblatt, 1958) was the first algorithmic model of a neuron. It is a tiny decision-maker that does just three steps.

Step A & B: z = (w₁·x₁) + (w₂·x₂) + … + b
Step C: if z ≥ 0 → output 1 else → output 0 Multiply inputs by weights, add the bias, then apply an activation to decide.

Component	Role
Inputs (x)	The data features
Weights (w)	How important each input is — learned during training
Bias (b)	A threshold/offset; lets the activation shift so it need not pass through the origin
Weighted sum (z)	z = Σ(w·x) + b — the linear part
Activation function	Decides if the neuron "fires"; adds non-linearity

🍕 Worked example — "Should I order pizza?" Inputs: hunger x₁ = 8, money x₂ = 20. Weights: w₁ = 2 (hunger matters a lot), w₂ = 1, bias b = −15 (general laziness).
z = (2 × 8) + (1 × 20) + (−15) = 16 + 20 − 15 = 21. Since z ≥ 0 → output 1 → order pizza!

💡 Tip — why a bias term? Without a bias, the decision boundary is forced through the origin. The bias lets the activation shift left or right, so the model can fit data that does not pass through (0,0). Examiners love this question — "Why is a bias necessary?".

From hard threshold to probability — the sigmoid activation

A plain step ("if z ≥ 0 → 1") is harsh. Often we want a probability, so we pass z through the sigmoid (from Lecture 4):

σ(z) = 1 / (1 + e^−z)

This "dimmer switch" squashes any z into (0, 1): big positive z → ≈1, big negative z → ≈0, z = 0 → 0.5.

Python · a perceptron by hand

import numpy as np

def perceptron(inputs, weights, bias):
    z = np.dot(inputs, weights) + bias      # weighted sum
    return 1 if z >= 0 else 0               # step activation

x = [8, 20]          # hunger, money
w = [2, 1]           # weights
b = -15              # bias

print("Decision:", perceptron(x, w, b))     # 1 = order pizza

OutputDecision: 1

7.3 The Single-Layer Perceptron & its limits

A single-layer perceptron is one layer of neurons mapping inputs directly to outputs. It is fundamentally a linear classifier — it draws a single straight line (or hyperplane) to separate two classes.

⚠️ The XOR problem — the famous limitation A single perceptron can model the AND and OR logic gates because their outputs are linearly separable (one straight line splits them). But it cannot model XOR (output 1 only when inputs differ) — no single straight line can separate XOR's classes. This observation by Minsky & Papert (1969) triggered the first "AI Winter".

Linear separability — data is linearly separable if a single straight line/plane can perfectly split the classes. A single-layer perceptron only works on linearly separable data.

The fix? Stack neurons into hidden layers — a Multi-Layer Perceptron (Lecture 8) — which can learn non-linear boundaries and solve XOR.

7.4 How a perceptron learns — Gradient Descent

⛰️ The foggy mountain Imagine standing blindfolded on "Mount Error" — height = your total mistake (loss), and you want the valley (zero error). Strategy: feel the slope under your feet (the gradient), take a small step downhill, and repeat. That is Gradient Descent.

Loss function — measures "how wrong is the model?". For regression, the squared error e = (y − ŷ)². Training = minimising the loss.

The update rule

New Weight = Old Weight − (Learning Rate × Gradient) We move opposite to the slope to go downhill. The gradient is ∂Loss/∂weight.

Learning rate — the size of each step. Too large → overshoot the valley, loss bounces or diverges. Too small → painfully slow convergence. If learning rate = 0 → the model never learns at all (no update).

🧩 Worked example — one gradient-descent step Predict sales. Inputs x₁=44.5, x₂=39.3; weights w₁=0.1, w₂=0.2, b=0; actual y=10.4.
Forward pass: prediction = 0.1·44.5 + 0.2·39.3 + 0 = 4.45 + 7.86 = 12.31.
Error: 10.4 − 12.31 = −1.91 (predicted too high).
Update (learning rate 0.0001): gradient term −2(y−ŷ) = 3.82. For w₁: slope = 3.82×44.5 = 170.1, so w₁_new = 0.1 − 0.0001×170.1 = 0.083.
Re-check: new prediction ≈ 10.96 — the error shrank from −1.91 to −0.56. The network is now strictly better.

Variants of Gradient Descent

Batch GD — uses the whole dataset for each update (stable but slow).
Stochastic GD (SGD) — updates after each single sample (fast, noisy).
Mini-batch GD — updates after a small batch (the practical default).

7.5 Backpropagation

Backpropagation — the algorithm that computes how much each weight contributed to the final error, by working backward from the output layer to the input layer using the chain rule of calculus.

The two passes of training

Forward pass — data flows input → output; the network makes a prediction and the loss is computed.
Backward pass (backpropagation) — the error is propagated backward; the chain rule computes the gradient ∂Loss/∂w for every weight. Gradient descent then updates the weights.

🔑 Backpropagation = "distributing the blame" We know the final error, but which weights caused it? Backpropagation answers that. Using the chain rule, it calculates each weight's share of responsibility (its gradient), layer by layer, from output back to input. Gradient descent then uses those gradients to nudge every weight toward less error.

Training terminology

Term	Meaning
Epoch	One complete pass through the entire training dataset
Batch size	Number of samples processed before the weights update once
Iteration	One weight update = Total samples ÷ Batch size (per epoch)

Python · the training loop concept (Keras)

from keras import models, layers
from keras.optimizers import SGD

# A simple single-layer network
model = models.Sequential()
model.add(layers.Dense(1, input_dim=2, activation='sigmoid'))

# loss measures error; SGD does gradient descent + backprop
model.compile(optimizer=SGD(learning_rate=0.01),
              loss='mse', metrics=['mae'])

# epochs = full passes over the data
model.fit(X_train, y_train, epochs=10, batch_size=32)

💡 Tip — the four-step DL workflow Every neural-network program follows: (1) Prepare data → (2) Define architecture (layers) → (3) Compile (choose loss + optimizer) → (4) Train (the fit loop runs forward pass → backprop → weight update for each epoch).

? Practice Questions

Perceptron arithmetic and the learning process are common exam material.

MCQQ1Components

In a perceptron, what do the weights represent?

A The final output
B The importance of each input
C The number of layers
D The learning rate

Answer: B

Each weight scales how strongly its input influences the decision — exactly like synapse strength in a biological neuron. Weights are what the network learns.

MCQQ2XOR

A single-layer perceptron cannot solve the XOR problem because:

A XOR has too much data
B XOR is not linearly separable — no single straight line splits its classes
C XOR needs a bias term
D XOR has three inputs

Answer: B

A single-layer perceptron is a linear classifier (one straight line). XOR's classes cannot be separated by any single line, so hidden layers are required.

MCQQ3Bias

Why is a bias term necessary in a neuron?

A It speeds up training
B It lets the activation shift, so the model can fit data not passing through the origin
C It removes the need for weights
D It is the model's output

Answer: B

Without a bias the decision boundary is locked through the origin. The bias is an offset that shifts the activation left/right, giving the model the flexibility to fit real data.

MCQQ4Gradient descent

What happens during training if the learning rate is set to 0?

A The model trains extremely fast
B The model never learns — weights never update
C The loss becomes negative
D The model overfits instantly

Answer: B

New weight = old weight − (learning rate × gradient). If the learning rate is 0, the step size is 0, so weights never change and no learning occurs.

MCQQ5Backpropagation

Backpropagation uses which mathematical rule to compute gradients?

A The Pythagorean theorem
B The chain rule of calculus
C Bayes' theorem
D The quadratic formula

Answer: B

Backpropagation applies the chain rule layer by layer, from the output backward to the input, to find how each weight affects the final loss.

MCQQ6Terminology

One complete pass through the entire training dataset is called:

A A batch
B An iteration
C An epoch
D A gradient

Answer: C

An epoch = one full pass over all training data. A batch is a subset processed before one update; an iteration is one update.

MCQQ7Learning rate

A learning rate that is far too large will most likely cause the loss to:

A Decrease smoothly and quickly
B Oscillate wildly or diverge (overshoot the minimum)
C Stay exactly constant
D Become a probability

Answer: B

Too-large steps overshoot the valley of the loss landscape, so the loss bounces around or even grows. Too-small steps make training painfully slow.

NumericalQ8Forward pass

A perceptron has inputs x₁=3, x₂=5; weights w₁=0.4, w₂=0.2; bias b=−2. Compute z and the step output (1 if z ≥ 0).

z = 0.2, output = 1

z = (0.4×3) + (0.2×5) + (−2) = 1.2 + 1.0 − 2.0 = 0.2. Since 0.2 ≥ 0, the step activation outputs 1.

NumericalQ9Iterations

A dataset has 5000 samples and the batch size is 100. How many iterations make up one epoch?

Answer: 50

Iterations per epoch = total samples ÷ batch size = 5000 ÷ 100 = 50. The weights are updated 50 times in one epoch.

CodingQ10Perceptron

Write a Python function that implements a perceptron with a step activation, given a list of inputs, weights, and a bias.

Solution

Python

def perceptron(inputs, weights, bias):
    # Step A & B: weighted sum
    z = bias
    for x, w in zip(inputs, weights):
        z += x * w
    # Step C: step activation
    return 1 if z >= 0 else 0

print(perceptron([3, 5], [0.4, 0.2], -2))   # z = 0.2 -> 1
print(perceptron([1, 1], [0.4, 0.2], -2))   # z = -1.4 -> 0

Output1 0

CodingQ11Sigmoid neuron

Modify a neuron to output a probability using the sigmoid activation instead of a hard step. Test with inputs [2, 3], weights [0.5, 0.5], bias 0.

Solution

Python

import numpy as np

def neuron(inputs, weights, bias):
    z = np.dot(inputs, weights) + bias
    return 1 / (1 + np.exp(-z))      # sigmoid activation

prob = neuron([2, 3], [0.5, 0.5], 0)
print("Probability:", round(prob, 4))
print("Class:", 1 if prob >= 0.5 else 0)

OutputProbability: 0.9241 Class: 1

z = 0.5·2 + 0.5·3 + 0 = 2.5; σ(2.5) ≈ 0.924, so the predicted class is 1.

Short AnswerQ12Concept

In one or two sentences, explain the difference between the forward pass and backpropagation.

Model answer

The forward pass sends data from inputs through the network to produce a prediction and compute the loss. Backpropagation then goes backward, using the chain rule to calculate how much each weight contributed to that loss (its gradient), so gradient descent can update the weights to reduce error.

🎯 Lecture 7 — must-remember Neuron: z = Σ(w·x) + b → activation. Perceptron = linear classifier; cannot solve XOR (not linearly separable). Bias shifts the boundary off the origin. Gradient descent: new w = old w − (learning rate × gradient). Backprop = chain rule, distributes "blame" backward. Epoch = full data pass; iteration = one update.

← Previous

Decision Trees

Types of Neural Networks