⚡ LECTURE 3

Supervised Learning — Linear Regression

The first and most fundamental supervised model. Learn how a straight line can predict numbers, how it learns the "best" line, the assumptions it relies on, and where it breaks.

Syllabus topics 8–10 ⏱ ~24 min read 11 practice questions

In this lecture

What Linear Regression is
Model Formulation
How the model learns — OLS
The Seven Assumptions
Limitations of Linear Regression
Practice Questions

3.1 What Linear Regression is

🍞 The bakery story You run a bakery and note each day: loaves baked vs money earned. After a month you see a trend — more loaves → more revenue. But how exactly? Does each loaf add ₹20 or ₹50? Linear Regression draws the best-fitting straight line through your scatter of points so you can answer: what is the relationship, how much does revenue change per extra loaf, and what revenue to expect tomorrow.

Linear Regression — a supervised, predictive model that estimates the relationship between a target variable y and one or more input features x by fitting a straight line (or a flat plane) that minimises the errors between predictions and actual values.

It is a regression task — the output is a continuous number (house price, salary, temperature, sales). It is usually the first supervised model taught because it is:

Simple — conceptually easy to grasp.
Interpretable — the slope directly tells you the effect of each feature.
Fast — computationally efficient, even on large datasets.
Effective — performs well when the relationship really is roughly linear.

Two types

Simple Linear Regression — one input variable. Revenue = f(loaves)
Multiple Linear Regression — many inputs. Revenue = f(loaves, advertising, weekday, discount)

The core idea is identical: fit a line/plane that best predicts the output.

3.2 Model Formulation

The simple linear regression equation:

y = β₀ + β₁x + ε

Term	Name	Meaning
y	Target / Output	What we predict — e.g. revenue
x	Input / Predictor	The feature — e.g. loaves baked
β₀	Intercept (bias)	Baseline value of y when x = 0
β₁	Slope (weight)	Change in y for a one-unit change in x
ε	Error term	Random variation the model cannot explain

You may also see it written y = mx + c where m is the slope and c the intercept — identical idea.

💡 Tip — interpreting the slope The slope is the headline result. If a house-price model learns Price = 0.0625 × Size − 8.75, the slope 0.0625 means each extra square foot adds 0.0625 lakhs to the predicted price. A positive slope = positive correlation; negative slope = negative correlation.

3.3 How the model learns — Ordinary Least Squares

Many straight lines could pass through a cloud of points. Which one does the model pick? The one that makes the smallest overall mistakes.

Residuals (prediction errors)

Residual — for each data point, error = actual value − predicted value. A positive residual means the prediction was too low; negative means too high.

Why square the errors?

If you simply add raw errors, positive and negative residuals cancel out — a bad line could look "good" by accident. So Linear Regression squares every error (removing the sign), adds them up, and chooses the line with the smallest total. This is Ordinary Least Squares (OLS).

Minimise Σ (y − ŷ)² = Σ (residual)² OLS finds the line with the minimum Sum of Squared Errors.

The closed-form OLS formulas

m = Σ(x − x̄)(y − ȳ) / Σ(x − x̄)²
c = ȳ − m·x̄ Numerator = covariance (do x and y move together?); denominator = variance of x (its spread). OLS forces the line through the average point (x̄, ȳ).

🧩 Worked example — house prices (from the worksheet) Data: sizes 800–1800 sq ft, prices 40–110 lakhs. Means: x̄ = 1260, ȳ = 70. Plugging into the formulas gives m = 0.0625 and c = −8.75, so the learned equation is Price = 0.0625 × Size − 8.75. Predicting a 1400 sq ft house: 0.0625 × 1400 − 8.75 = 78.75 lakhs.

Doing it in code — by hand then with scikit-learn

Python · OLS by hand

import pandas as pd

df = pd.DataFrame({"Size": [800,1000,1200,1500,1800],
                   "Price": [40,60,65,75,110]})
x, y = df["Size"], df["Price"]
x_mean, y_mean = x.mean(), y.mean()

# slope and intercept from the OLS formulas
m = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum()
c = y_mean - m * x_mean
print(f"Price = {m:.4f} x Size + {c:.2f}")
print("Predict 1400 sq ft:", m * 1400 + c)

OutputPrice = 0.0625 x Size + -8.75 Predict 1400 sq ft: 78.75

Python · same thing with scikit-learn

from sklearn.linear_model import LinearRegression

X = df[["Size"]]       # 2D for sklearn
y = df["Price"]

model = LinearRegression()
model.fit(X, y)

print("Slope (coef_):", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predict 1400:", model.predict([[1400]]))

OutputSlope (coef_): 0.0625 Intercept: -8.75 Predict 1400: [78.75]

💡 Tip — sklearn attributes to memorise After .fit(): model.coef_ holds the slope(s) and model.intercept_ holds β₀. For multiple regression, coef_ is an array — one weight per feature, in the same order as the columns of X.

Diagnostic plots

Residuals vs Fitted plot — checks linearity & constant variance. Points should scatter randomly around 0.
Q-Q plot — checks if residuals are normally distributed. Points should lie along the diagonal.

3.4 The Seven Assumptions

Linear Regression only works well if these assumptions are approximately true. Examiners frequently ask you to name or recognise them.

#	Assumption	What it means / fix if violated
1	Linearity	The input–output relationship must be roughly a straight line. Curved residual plot = violated. Fix: polynomial terms, log transform, or a tree model.
2	Independence of Errors	Errors must not correlate with each other (critical for time-series). Fix: lag features or time-series models.
3	Homoscedasticity	Error variance is constant across all x. If errors "fan out", variance is not constant. Fix: log-transform y, or weighted least squares.
4	Zero Mean of Errors	Residuals should average to zero. Fix: always include an intercept term.
5	No Multicollinearity	Input features must not be highly correlated with each other. Fix: drop correlated features, PCA, or ridge regression.
6	Exogeneity	Predictors must not correlate with the error term, else coefficients are biased. Fix: add confounders or instrumental variables.
7	Normality of Errors	For small datasets, residuals should be approximately normally distributed. Fix: transformations or bootstrap methods.

🔑 Memory hook — "L.I.N.E." The four most-tested assumptions: Linearity, Independence of errors, Normality of errors, and Equal variance (Homoscedasticity). Add: no multicollinearity, zero-mean errors, exogeneity → seven total.

3.5 Limitations of Linear Regression

Only models linear relationships — it fails badly on curved or complex non-linear patterns.
Extremely sensitive to outliers — a single extreme point can drastically tilt the fitted line (because OLS squares errors, big errors dominate).
Struggles with correlated predictors — multicollinearity makes coefficient estimates unstable and hard to interpret.
Omitted variable bias — leaving out an important feature distorts the coefficients of the remaining ones.

⚠️ Exam trap "Linear Regression assumes a straight-line relationship." If the data is a curve (or worse, dosage data that goes up then down), a straight line cannot fit it — this is exactly why Decision Trees (Lecture 6) exist.

? Practice Questions

Choose, check, and read every explanation — these mirror the exam style closely.

MCQQ1Formulation

In y = β₀ + β₁x + ε, what does β₁ represent?

A The value of y when x = 0
B The change in y for a one-unit increase in x
C The random error
D The total number of data points

Answer: B

β₁ is the slope — the change in y per unit change in x. β₀ (the intercept) is the value of y when x = 0; ε is the random error term.

MCQQ2OLS

Why does Ordinary Least Squares square the errors instead of just adding them?

A To make the maths run faster
B So positive and negative errors do not cancel each other out
C Because errors are always negative
D To convert the result into a percentage

Answer: B

If you add raw residuals, a +10 and a −10 cancel — a terrible line could look perfect. Squaring removes the sign, so all errors contribute positively, and also penalises large errors more heavily.

MCQQ3Assumptions

"The variance of the residuals is constant across all values of x." This assumption is called:

A Linearity
B Homoscedasticity
C Multicollinearity
D Exogeneity

Answer: B

Homoscedasticity = constant error variance. When errors "fan out" (variance grows with x) it is called heteroscedasticity, and the assumption is violated.

MCQQ4Limitations

Linear Regression is a poor choice when:

A The relationship between x and y is roughly a straight line
B The data has a strongly non-linear (curved) pattern
C You want an interpretable model
D The dataset is large

Answer: B

Linear Regression fundamentally fits a straight line; it fails on curved/non-linear patterns. It is fine and even preferred for linear relationships, interpretability and large datasets.

MCQQ5Multicollinearity

Including both "Weight in kg" and "Weight in lbs" as features causes:

A Heteroscedasticity
B Multicollinearity — making coefficient estimates unstable
C Underfitting
D Nothing — it improves accuracy

Answer: B

The two columns are perfectly correlated copies of each other. The model cannot decide how to split the weight between them, so coefficients become unstable and uninterpretable.

MCQQ6Outliers

Why is Linear Regression especially sensitive to outliers?

A OLS squares the errors, so a single huge error dominates the total and tilts the line
B It ignores all large values by design
C Outliers make the model run out of memory
D It is not sensitive to outliers at all

Answer: A

Because errors are squared, an outlier with a large residual contributes an enormous squared term, so the fitted line shifts dramatically to reduce that one error.

NumericalQ7Prediction

A model learned Price = 0.0625 × Size − 8.75. What is the predicted price for a 2000 sq ft house?

Answer: 116.25

Price = 0.0625 × 2000 − 8.75 = 125 − 8.75 = 116.25 lakhs. Just substitute the size into the learned equation.

NumericalQ8Residuals

A model predicts a price of 66.25 for a house whose actual price is 65. What is the residual?

Answer: −1.25

Residual = actual − predicted = 65 − 66.25 = −1.25. A negative residual means the model over-predicted.

CodingQ9Train a model

Using scikit-learn, train a simple linear regression on X = [[1],[2],[3],[4],[5]] and y = [3,5,7,9,11], then print the slope, intercept, and the prediction for x = 8.

Solution

Python

from sklearn.linear_model import LinearRegression

X = [[1], [2], [3], [4], [5]]
y = [3, 5, 7, 9, 11]              # the true rule is y = 2x + 1

model = LinearRegression()
model.fit(X, y)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predict x=8:", model.predict([[8]]))

OutputSlope: 2.0 Intercept: 1.0 Predict x=8: [17.]

The model recovers the exact rule y = 2x + 1, so x = 8 → 2(8)+1 = 17.

CodingQ10Multiple regression

You have a DataFrame df with feature columns ['Hours','PrevScore'] and target 'Result'. Write code to split into train/test sets, train a multiple linear regression, and print the model's coefficients.

Solution

Python

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = df[['Hours', 'PrevScore']]   # multiple features
y = df['Result']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)     # one weight per feature
print("Intercept:", model.intercept_)

OutputCoefficients: [2.865 1.021] Intercept: -34.30

Each value in coef_ is the weight for one feature, in column order: Hours → 2.865, PrevScore → 1.021.

Short AnswerQ11Concept

Name any three assumptions of Linear Regression and briefly explain why each matters.

Model answer

Linearity — the model fits a straight line, so the true relationship must be roughly linear or predictions are systematically wrong. Homoscedasticity — error variance must be constant; if it grows with x, confidence in predictions becomes unreliable. No multicollinearity — input features must not be near-duplicates, otherwise the coefficients become unstable and uninterpretable. (Other valid answers: independence of errors, normality of errors, zero-mean errors, exogeneity.)

🎯 Lecture 3 — must-remember Equation: y = β₀ + β₁x + ε. OLS minimises Σ(y−ŷ)² (squares errors so they don't cancel). Slope formula: Σ(x−x̄)(y−ȳ)/Σ(x−x̄)². Seven assumptions (L.I.N.E. + no multicollinearity, zero-mean errors, exogeneity). Limitations: linear-only, outlier-sensitive, multicollinearity, omitted-variable bias.

← Previous

Data Preprocessing

Logistic Regression