GenAI Exam Prep
Home Mock Exam
⚡ LECTURE 5

Machine Learning Model Evaluation

A "90% accurate" model can still lose money. Learn how to truly judge a model — confusion matrices, precision/recall, F1, regression metrics, the bias-variance tradeoff and cross-validation.

Syllabus topics 15–19 ⏱ ~28 min read 12 practice questions

5.1 Classification Metrics & the Confusion Matrix

⚠️ The Accuracy Trap In a dataset with 99% healthy and 1% sick patients, a "lazy" model that predicts "Healthy" for everyone scores 99% accuracy — yet it is useless: it never catches a single sick patient. Accuracy is misleading on imbalanced datasets.

The Confusion Matrix

A 2×2 table comparing predictions against reality. Memorise these four cells:

Predicted PositivePredicted Negative
Actual PositiveTP — True Positive ✅FN — False Negative ❌ (Type II error)
Actual NegativeFP — False Positive ❌ (Type I error)TN — True Negative ✅

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN) Use it ONLY when classes are roughly balanced (e.g. 50/50).

5.2 Precision, Recall & F1 Score

Precision — "of what I flagged, how much was right?"

Precision = TP / (TP + FP) Focus: minimising False Positives.

Use when false positives are costly. Example: spam detection — you must not send an important email to the spam folder, so high precision matters.

Recall (Sensitivity) — "of all real positives, how many did I catch?"

Recall = TP / (TP + FN) Focus: minimising False Negatives.

Use when false negatives are dangerous. Example: cancer diagnosis — you cannot afford to miss a sick patient, so high recall is critical.

F1 Score — the balance

F1 Score — the harmonic mean of precision and recall. It is the best single metric for imbalanced datasets when you care about both FP and FN.
F1 = 2 × (Precision × Recall) / (Precision + Recall) Harmonic mean punishes extremes — if precision OR recall is 0, F1 is 0.
🔑 The Precision-Recall Tradeoff Raising the threshold makes the model conservative → fewer predictions, but more correct → high precision, low recall. Lowering the threshold makes it liberal → catches everyone including noise → high recall, low precision. You usually cannot maximise both at once.
🧩 The Diabetes Paradox (from the worksheet) 10 patients, 1 actually sick. A lazy model predicts "Healthy" for all but accidentally flags one healthy person. Result: TP=0, FN=1, FP=1, TN=8. Accuracy = 8/10 = 80% (looks decent). But Recall = TP/(TP+FN) = 0/1 = 0% — it caught zero sick patients. Precision = TP/(TP+FP) = 0/1 = 0%. The 80% accuracy completely hid a catastrophic failure.
Python · classification metrics
from sklearn.metrics import (confusion_matrix, precision_score,
                             recall_score, f1_score, accuracy_score)

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]

print("Confusion matrix:\n", confusion_matrix(y_true, y_pred))
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", round(precision_score(y_true, y_pred), 3))
print("Recall   :", round(recall_score(y_true, y_pred), 3))
print("F1 Score :", round(f1_score(y_true, y_pred), 3))
OutputConfusion matrix: [[3 1] [1 3]] Accuracy : 0.75 Precision: 0.75 Recall : 0.75 F1 Score : 0.75
💡 Tip — which metric when? Spam filter → Precision (avoid false alarms). Cancer / fraud detection → Recall (never miss a real case). Imbalanced data, both errors matter → F1. Balanced classes → Accuracy is fine.

5.3 Regression Metrics

For regression, error is the residual = actual − predicted. The metrics summarise these residuals.

MAE — Mean Absolute Error

MAE = (1/n) Σ |y − ŷ| Average size of errors. Same units as y. Robust to outliers.

Pro: easy to interpret, robust to outliers. Con: treats a big mistake linearly — does not punish large errors extra.

MSE — Mean Squared Error

MSE = (1/n) Σ (y − ŷ)² Squares errors → punishes large mistakes heavily. Units are squared (hard to interpret).

RMSE — Root Mean Squared Error

RMSE = √MSE Brings the penalty back to the original units. RMSE is always ≥ MAE.

RMSE keeps MSE's heavy penalty for big errors but is interpretable like MAE.

R² — Coefficient of Determination

R² (R-squared) — the proportion of variance in the target that the model explains. R² = 0.80 means the model explains 80% of the variability. Range is typically 0 to 1; 1.0 is perfect.
R² = 1 − (RSS / TSS) RSS = Σ(y−ŷ)² (model error); TSS = Σ(y−ȳ)² (error if you just guessed the mean).

Adjusted R²

⚠️ Why standard R² is misleading Plain R² always increases when you add more features — even useless, random ones. Adjusted R² adds a penalty for the number of predictors. If you add a feature and Adjusted R² drops, that feature is not useful. Use Adjusted R² when comparing models with different feature counts.
🧩 Worked example — the Darts Game (5 houses) Errors: 10, −20, 20, 10, 100.  |errors|: 10,20,20,10,100 (sum 160). errors²: 100,400,400,100,10000 (sum 11000).
MAE = 160/5 = 32  ·  MSE = 11000/5 = 2200  ·  RMSE = √2200 ≈ 46.9.
The outlier house (error 100) added 100 to MAE but 10,000 to MSE — that is why "MSE punishes large errors, MAE ignores outliers."
Python · regression metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_true = [100, 150, 200, 250, 500]
y_pred = [90, 170, 180, 240, 400]

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_true, y_pred)

print(f"MAE  = {mae}")
print(f"MSE  = {mse}")
print(f"RMSE = {rmse:.1f}")
print(f"R2   = {r2:.2f}")
OutputMAE = 32.0 MSE = 2200.0 RMSE = 46.9 R2 = 0.89

5.4 Bias & Variance

🎯 The Archer analogy High Bias = the archer is consistent but consistently wrong — all arrows tightly grouped, but far from the bullseye. High Variance = the arrows average out near the centre, but are scattered everywhere — wildly inconsistent.

High Bias → Underfitting

Bias — error from a model being too simple to capture the real pattern (e.g. using a straight line for curved data). The model underfits.

Signs: poor performance on both training and test data. Causes: model too simple, or missing useful features.

High Variance → Overfitting

Variance — error from a model being too complex, memorising noise in the training data instead of the real pattern. The model overfits.

Signs: excellent on training data but poor on test data (a big gap). Causes: model too complex/flexible, or too little training data.

5.5 The Bias-Variance Tradeoff

ModelBiasVarianceResult
Too simpleHighLowUnderfitting — oversimplifies reality
Too complexLowHighOverfitting — memorises noise
Just rightBalancedBalancedThe "sweet spot" — good generalisation

You cannot minimise both to zero at once — reducing bias tends to raise variance and vice versa. The goal is the sweet spot that minimises total error on unseen data.

⚠️ Diagnose it instantly Training accuracy 100%, test accuracy 60% → high variance (overfitting). Training accuracy 65%, test accuracy 63% (both low) → high bias (underfitting).

Fixing overfitting

Get more data · data augmentation · reduce model complexity · early stopping · regularisation.

5.6 Cross-Validation

🔄 The Rotation Policy A single train/test split might be lucky or unlucky. Instead of always testing on the same chapter, rotate the chapters so you are tested on the whole book. That is cross-validation.

K-Fold Cross-Validation

Split the data into K equal folds. Train on K−1 folds, test on the remaining 1. Repeat K times so every fold is used as the test set once. Average the K scores for a reliable estimate.

🔑 Counting question — exam favourite In K-fold cross-validation, the model is trained and tested K times. So in 10-fold cross-validation, the model is trained & tested 10 times.
Python · K-Fold cross-validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
# cv=5  ->  5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print("Scores per fold:", scores.round(3))
print("Average score  :", scores.mean().round(3))
OutputScores per fold: [0.92 0.89 0.94 0.9 0.91] Average score : 0.912
💡 Tip — holdout set Beyond train/test, a final holdout set acts as a "vault" — never touched during model building or hyperparameter tuning — so your final reported score is honest and free of data leakage.
? Practice Questions

Evaluation metrics are heavily tested. Work through every question.

MCQQ1Accuracy trap

In a dataset with 99 healthy and 1 sick patient, a model predicts "healthy" for everyone. Its accuracy is:

  • A 1%
  • B 50%
  • C 99%
  • D 0%
Answer: C

99 of 100 predictions are correct → 99% accuracy. Yet recall = 0% — it catches no sick patient. This is the accuracy trap on imbalanced data.

MCQQ2Confusion matrix

A model predicts a patient is healthy, but the patient is actually sick. This is a:

  • A True Positive
  • B False Positive (Type I error)
  • C False Negative (Type II error)
  • D True Negative
Answer: C

Predicted negative (healthy) but actually positive (sick) = False Negative, also called a Type II error — a dangerous "miss" in medicine.

MCQQ3Recall

For cancer diagnosis, which metric is most critical?

  • A Precision
  • B Recall
  • C Accuracy
  • D Mean Squared Error
Answer: B

Missing a sick patient (false negative) is dangerous, so we maximise Recall = TP/(TP+FN) to catch as many real cases as possible.

MCQQ4F1

The F1 score is the:

  • A Arithmetic mean of precision and recall
  • B Harmonic mean of precision and recall
  • C Sum of TP and TN
  • D Square root of accuracy
Answer: B

F1 = 2·(P·R)/(P+R) is the harmonic mean. It punishes extremes — if either precision or recall is 0, F1 is 0.

MCQQ5Regression metrics

Which regression metric punishes large errors the most heavily?

  • A MAE
  • B MSE
  • C
  • D Accuracy
Answer: B

MSE squares every error, so a large error contributes enormously (an error of 100 becomes 10,000). MAE treats errors linearly.

MCQQ6Overfitting

A model scores 100% on training data but 60% on test data. It most likely suffers from:

  • A High bias (underfitting)
  • B High variance (overfitting)
  • C Low bias and low variance
  • D Perfect generalisation
Answer: B

A large gap between training (high) and test (low) accuracy is the signature of overfitting — the model memorised noise (high variance).

MCQQ7Adjusted R²

Why use Adjusted R² instead of plain R²?

  • A It is always larger
  • B It penalises adding useless extra features, unlike plain R²
  • C It works only for classification
  • D It ignores the training data
Answer: B

Plain R² never decreases when features are added (even random ones). Adjusted R² penalises the feature count, so it can drop if a feature is useless.

MCQQ8Cross-validation

In 10-fold cross-validation, the model is trained and tested:

  • A 1 time
  • B 5 times
  • C 10 times
  • D 100 times
Answer: C

K-fold runs K iterations — each fold is the test set exactly once. K=10 → trained & tested 10 times, then the scores are averaged.

NumericalQ9Precision & Recall

A confusion matrix gives TP = 40, FP = 10, FN = 20, TN = 30. Compute Precision and Recall.

Precision = 0.8, Recall ≈ 0.667

Precision = TP/(TP+FP) = 40/(40+10) = 40/50 = 0.80.
Recall = TP/(TP+FN) = 40/(40+20) = 40/60 ≈ 0.667.

NumericalQ10F1

If Precision = 0.6 and Recall = 0.6, what is the F1 score?

Answer: 0.6

F1 = 2·(P·R)/(P+R) = 2·(0.36)/(1.2) = 0.72/1.2 = 0.6. When precision = recall, F1 equals that same value.

CodingQ11Metrics in code

Given y_true = [200,300,400] and y_pred = [210,290,440], write code to print MAE, MSE and RMSE.

Solution
Python
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_true = [200, 300, 400]
y_pred = [210, 290, 440]

mae  = mean_absolute_error(y_true, y_pred)
mse  = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

print("MAE :", mae)
print("MSE :", mse)
print("RMSE:", round(rmse, 2))
OutputMAE : 20.0 MSE : 600.0 RMSE: 24.49

Errors are 10, 10, 40. |errors| average = 60/3 = 20 (MAE). errors² = 100+100+1600 = 1800, /3 = 600 (MSE). RMSE = √600 ≈ 24.49.

Short AnswerQ12Bias-Variance

Explain the bias-variance tradeoff in your own words, and why we cannot eliminate both.

Model answer

Bias is error from a model that is too simple (it underfits — wrong on both training and test data). Variance is error from a model that is too complex (it overfits — memorises noise, great on training but poor on test). Making a model more complex lowers bias but raises variance, and simplifying it does the reverse — so reducing one tends to increase the other. The goal is the "sweet spot" that balances them to minimise total error on unseen data.

🎯 Lecture 5 — must-remember Confusion matrix: TP/TN/FP(Type I)/FN(Type II). Precision = TP/(TP+FP); Recall = TP/(TP+FN); F1 = harmonic mean. Regression: MAE (robust), MSE (punishes big errors), RMSE = √MSE, R² = 1−RSS/TSS. Underfitting = high bias; overfitting = high variance. K-fold CV trains & tests K times.