Machine Learning Model Evaluation
A "90% accurate" model can still lose money. Learn how to truly judge a model — confusion matrices, precision/recall, F1, regression metrics, the bias-variance tradeoff and cross-validation.
In this lecture
5.1 Classification Metrics & the Confusion Matrix
The Confusion Matrix
A 2×2 table comparing predictions against reality. Memorise these four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP — True Positive ✅ | FN — False Negative ❌ (Type II error) |
| Actual Negative | FP — False Positive ❌ (Type I error) | TN — True Negative ✅ |
- TP — correctly predicted positive.
- TN — correctly predicted negative.
- FP (Type I error) — predicted positive but actually negative (a "false alarm").
- FN (Type II error) — predicted negative but actually positive (a "miss").
Accuracy
5.2 Precision, Recall & F1 Score
Precision — "of what I flagged, how much was right?"
Use when false positives are costly. Example: spam detection — you must not send an important email to the spam folder, so high precision matters.
Recall (Sensitivity) — "of all real positives, how many did I catch?"
Use when false negatives are dangerous. Example: cancer diagnosis — you cannot afford to miss a sick patient, so high recall is critical.
F1 Score — the balance
from sklearn.metrics import (confusion_matrix, precision_score,
recall_score, f1_score, accuracy_score)
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]
print("Confusion matrix:\n", confusion_matrix(y_true, y_pred))
print("Accuracy :", accuracy_score(y_true, y_pred))
print("Precision:", round(precision_score(y_true, y_pred), 3))
print("Recall :", round(recall_score(y_true, y_pred), 3))
print("F1 Score :", round(f1_score(y_true, y_pred), 3))
5.3 Regression Metrics
For regression, error is the residual = actual − predicted. The metrics summarise these residuals.
MAE — Mean Absolute Error
Pro: easy to interpret, robust to outliers. Con: treats a big mistake linearly — does not punish large errors extra.
MSE — Mean Squared Error
RMSE — Root Mean Squared Error
RMSE keeps MSE's heavy penalty for big errors but is interpretable like MAE.
R² — Coefficient of Determination
Adjusted R²
MAE = 160/5 = 32 · MSE = 11000/5 = 2200 · RMSE = √2200 ≈ 46.9.
The outlier house (error 100) added 100 to MAE but 10,000 to MSE — that is why "MSE punishes large errors, MAE ignores outliers."
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
y_true = [100, 150, 200, 250, 500]
y_pred = [90, 170, 180, 240, 400]
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
print(f"MAE = {mae}")
print(f"MSE = {mse}")
print(f"RMSE = {rmse:.1f}")
print(f"R2 = {r2:.2f}")
5.4 Bias & Variance
High Bias → Underfitting
Signs: poor performance on both training and test data. Causes: model too simple, or missing useful features.
High Variance → Overfitting
Signs: excellent on training data but poor on test data (a big gap). Causes: model too complex/flexible, or too little training data.
5.5 The Bias-Variance Tradeoff
| Model | Bias | Variance | Result |
|---|---|---|---|
| Too simple | High | Low | Underfitting — oversimplifies reality |
| Too complex | Low | High | Overfitting — memorises noise |
| Just right | Balanced | Balanced | The "sweet spot" — good generalisation |
You cannot minimise both to zero at once — reducing bias tends to raise variance and vice versa. The goal is the sweet spot that minimises total error on unseen data.
Fixing overfitting
Get more data · data augmentation · reduce model complexity · early stopping · regularisation.
5.6 Cross-Validation
K-Fold Cross-Validation
Split the data into K equal folds. Train on K−1 folds, test on the remaining 1. Repeat K times so every fold is used as the test set once. Average the K scores for a reliable estimate.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# cv=5 -> 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print("Scores per fold:", scores.round(3))
print("Average score :", scores.mean().round(3))
Evaluation metrics are heavily tested. Work through every question.
In a dataset with 99 healthy and 1 sick patient, a model predicts "healthy" for everyone. Its accuracy is:
99 of 100 predictions are correct → 99% accuracy. Yet recall = 0% — it catches no sick patient. This is the accuracy trap on imbalanced data.
A model predicts a patient is healthy, but the patient is actually sick. This is a:
Predicted negative (healthy) but actually positive (sick) = False Negative, also called a Type II error — a dangerous "miss" in medicine.
For cancer diagnosis, which metric is most critical?
Missing a sick patient (false negative) is dangerous, so we maximise Recall = TP/(TP+FN) to catch as many real cases as possible.
The F1 score is the:
F1 = 2·(P·R)/(P+R) is the harmonic mean. It punishes extremes — if either precision or recall is 0, F1 is 0.
Which regression metric punishes large errors the most heavily?
MSE squares every error, so a large error contributes enormously (an error of 100 becomes 10,000). MAE treats errors linearly.
A model scores 100% on training data but 60% on test data. It most likely suffers from:
A large gap between training (high) and test (low) accuracy is the signature of overfitting — the model memorised noise (high variance).
Why use Adjusted R² instead of plain R²?
Plain R² never decreases when features are added (even random ones). Adjusted R² penalises the feature count, so it can drop if a feature is useless.
In 10-fold cross-validation, the model is trained and tested:
K-fold runs K iterations — each fold is the test set exactly once. K=10 → trained & tested 10 times, then the scores are averaged.
A confusion matrix gives TP = 40, FP = 10, FN = 20, TN = 30. Compute Precision and Recall.
Precision = TP/(TP+FP) = 40/(40+10) = 40/50 = 0.80.
Recall = TP/(TP+FN) = 40/(40+20) = 40/60 ≈ 0.667.
If Precision = 0.6 and Recall = 0.6, what is the F1 score?
F1 = 2·(P·R)/(P+R) = 2·(0.36)/(1.2) = 0.72/1.2 = 0.6. When precision = recall, F1 equals that same value.
Given y_true = [200,300,400] and y_pred = [210,290,440], write code to print MAE, MSE and RMSE.
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
y_true = [200, 300, 400]
y_pred = [210, 290, 440]
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print("MAE :", mae)
print("MSE :", mse)
print("RMSE:", round(rmse, 2))
Errors are 10, 10, 40. |errors| average = 60/3 = 20 (MAE). errors² = 100+100+1600 = 1800, /3 = 600 (MSE). RMSE = √600 ≈ 24.49.
Explain the bias-variance tradeoff in your own words, and why we cannot eliminate both.
Bias is error from a model that is too simple (it underfits — wrong on both training and test data). Variance is error from a model that is too complex (it overfits — memorises noise, great on training but poor on test). Making a model more complex lowers bias but raises variance, and simplifying it does the reverse — so reducing one tends to increase the other. The goal is the "sweet spot" that balances them to minimise total error on unseen data.