Supervised Learning — Linear Regression
The first and most fundamental supervised model. Learn how a straight line can predict numbers, how it learns the "best" line, the assumptions it relies on, and where it breaks.
In this lecture
3.1 What Linear Regression is
It is a regression task — the output is a continuous number (house price, salary, temperature, sales). It is usually the first supervised model taught because it is:
- Simple — conceptually easy to grasp.
- Interpretable — the slope directly tells you the effect of each feature.
- Fast — computationally efficient, even on large datasets.
- Effective — performs well when the relationship really is roughly linear.
Two types
- Simple Linear Regression — one input variable.
Revenue = f(loaves) - Multiple Linear Regression — many inputs.
Revenue = f(loaves, advertising, weekday, discount)
The core idea is identical: fit a line/plane that best predicts the output.
3.2 Model Formulation
The simple linear regression equation:
| Term | Name | Meaning |
|---|---|---|
| y | Target / Output | What we predict — e.g. revenue |
| x | Input / Predictor | The feature — e.g. loaves baked |
| β0 | Intercept (bias) | Baseline value of y when x = 0 |
| β1 | Slope (weight) | Change in y for a one-unit change in x |
| ε | Error term | Random variation the model cannot explain |
You may also see it written y = mx + c where m is the slope and c the intercept — identical idea.
Price = 0.0625 × Size − 8.75, the slope 0.0625 means each extra square foot adds 0.0625 lakhs to the predicted price. A positive slope = positive correlation; negative slope = negative correlation.
3.3 How the model learns — Ordinary Least Squares
Many straight lines could pass through a cloud of points. Which one does the model pick? The one that makes the smallest overall mistakes.
Residuals (prediction errors)
error = actual value − predicted value. A positive residual means the prediction was too low; negative means too high.
Why square the errors?
If you simply add raw errors, positive and negative residuals cancel out — a bad line could look "good" by accident. So Linear Regression squares every error (removing the sign), adds them up, and chooses the line with the smallest total. This is Ordinary Least Squares (OLS).
The closed-form OLS formulas
c = ȳ − m·x̄ Numerator = covariance (do x and y move together?); denominator = variance of x (its spread). OLS forces the line through the average point (x̄, ȳ).
Price = 0.0625 × Size − 8.75. Predicting a 1400 sq ft house: 0.0625 × 1400 − 8.75 = 78.75 lakhs.
Doing it in code — by hand then with scikit-learn
import pandas as pd
df = pd.DataFrame({"Size": [800,1000,1200,1500,1800],
"Price": [40,60,65,75,110]})
x, y = df["Size"], df["Price"]
x_mean, y_mean = x.mean(), y.mean()
# slope and intercept from the OLS formulas
m = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum()
c = y_mean - m * x_mean
print(f"Price = {m:.4f} x Size + {c:.2f}")
print("Predict 1400 sq ft:", m * 1400 + c)
from sklearn.linear_model import LinearRegression
X = df[["Size"]] # 2D for sklearn
y = df["Price"]
model = LinearRegression()
model.fit(X, y)
print("Slope (coef_):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predict 1400:", model.predict([[1400]]))
.fit(): model.coef_ holds the slope(s) and model.intercept_ holds β0. For multiple regression, coef_ is an array — one weight per feature, in the same order as the columns of X.
Diagnostic plots
- Residuals vs Fitted plot — checks linearity & constant variance. Points should scatter randomly around 0.
- Q-Q plot — checks if residuals are normally distributed. Points should lie along the diagonal.
3.4 The Seven Assumptions
Linear Regression only works well if these assumptions are approximately true. Examiners frequently ask you to name or recognise them.
| # | Assumption | What it means / fix if violated |
|---|---|---|
| 1 | Linearity | The input–output relationship must be roughly a straight line. Curved residual plot = violated. Fix: polynomial terms, log transform, or a tree model. |
| 2 | Independence of Errors | Errors must not correlate with each other (critical for time-series). Fix: lag features or time-series models. |
| 3 | Homoscedasticity | Error variance is constant across all x. If errors "fan out", variance is not constant. Fix: log-transform y, or weighted least squares. |
| 4 | Zero Mean of Errors | Residuals should average to zero. Fix: always include an intercept term. |
| 5 | No Multicollinearity | Input features must not be highly correlated with each other. Fix: drop correlated features, PCA, or ridge regression. |
| 6 | Exogeneity | Predictors must not correlate with the error term, else coefficients are biased. Fix: add confounders or instrumental variables. |
| 7 | Normality of Errors | For small datasets, residuals should be approximately normally distributed. Fix: transformations or bootstrap methods. |
3.5 Limitations of Linear Regression
- Only models linear relationships — it fails badly on curved or complex non-linear patterns.
- Extremely sensitive to outliers — a single extreme point can drastically tilt the fitted line (because OLS squares errors, big errors dominate).
- Struggles with correlated predictors — multicollinearity makes coefficient estimates unstable and hard to interpret.
- Omitted variable bias — leaving out an important feature distorts the coefficients of the remaining ones.
Choose, check, and read every explanation — these mirror the exam style closely.
In y = β₀ + β₁x + ε, what does β₁ represent?
β₁ is the slope — the change in y per unit change in x. β₀ (the intercept) is the value of y when x = 0; ε is the random error term.
Why does Ordinary Least Squares square the errors instead of just adding them?
If you add raw residuals, a +10 and a −10 cancel — a terrible line could look perfect. Squaring removes the sign, so all errors contribute positively, and also penalises large errors more heavily.
"The variance of the residuals is constant across all values of x." This assumption is called:
Homoscedasticity = constant error variance. When errors "fan out" (variance grows with x) it is called heteroscedasticity, and the assumption is violated.
Linear Regression is a poor choice when:
Linear Regression fundamentally fits a straight line; it fails on curved/non-linear patterns. It is fine and even preferred for linear relationships, interpretability and large datasets.
Including both "Weight in kg" and "Weight in lbs" as features causes:
The two columns are perfectly correlated copies of each other. The model cannot decide how to split the weight between them, so coefficients become unstable and uninterpretable.
Why is Linear Regression especially sensitive to outliers?
Because errors are squared, an outlier with a large residual contributes an enormous squared term, so the fitted line shifts dramatically to reduce that one error.
A model learned Price = 0.0625 × Size − 8.75. What is the predicted price for a 2000 sq ft house?
Price = 0.0625 × 2000 − 8.75 = 125 − 8.75 = 116.25 lakhs. Just substitute the size into the learned equation.
A model predicts a price of 66.25 for a house whose actual price is 65. What is the residual?
Residual = actual − predicted = 65 − 66.25 = −1.25. A negative residual means the model over-predicted.
Using scikit-learn, train a simple linear regression on X = [[1],[2],[3],[4],[5]] and y = [3,5,7,9,11], then print the slope, intercept, and the prediction for x = 8.
from sklearn.linear_model import LinearRegression
X = [[1], [2], [3], [4], [5]]
y = [3, 5, 7, 9, 11] # the true rule is y = 2x + 1
model = LinearRegression()
model.fit(X, y)
print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predict x=8:", model.predict([[8]]))
The model recovers the exact rule y = 2x + 1, so x = 8 → 2(8)+1 = 17.
You have a DataFrame df with feature columns ['Hours','PrevScore'] and target 'Result'. Write code to split into train/test sets, train a multiple linear regression, and print the model's coefficients.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df[['Hours', 'PrevScore']] # multiple features
y = df['Result']
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.75, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print("Coefficients:", model.coef_) # one weight per feature
print("Intercept:", model.intercept_)
Each value in coef_ is the weight for one feature, in column order: Hours → 2.865, PrevScore → 1.021.
Name any three assumptions of Linear Regression and briefly explain why each matters.
Linearity — the model fits a straight line, so the true relationship must be roughly linear or predictions are systematically wrong. Homoscedasticity — error variance must be constant; if it grows with x, confidence in predictions becomes unreliable. No multicollinearity — input features must not be near-duplicates, otherwise the coefficients become unstable and uninterpretable. (Other valid answers: independence of errors, normality of errors, zero-mean errors, exogeneity.)