⚡ LECTURE 2

Data Preprocessing

Models are only as good as the data fed to them. This lecture covers cleaning that data — handling missing values, removing outliers, scaling features and encoding text into numbers.

Syllabus topics 5–7 ⏱ ~24 min read 11 practice questions

In this lecture

Why preprocessing matters
Handling Missing Values
Handling Outliers (Z-Score & IQR)
Feature Scaling
Data Encoding Methods
Practice Questions

2.1 Why preprocessing matters

🔑 Garbage In, Garbage Out Preprocessing is the most critical step in the ML pipeline. A model trained on dirty data produces unreliable predictions, no matter how advanced it is. "A simple model with clean data beats an advanced model with messy data."

Think of an ML model as a Ferrari engine and data as the fuel. Pour in contaminated fuel and even a Ferrari breaks down. The preprocessing pipeline has four stages we care about: load → inspect → clean (missing values + outliers) → transform (scale + encode).

Inspecting data first

Before cleaning anything, you must understand the data. Key pandas tools:

Method	What it shows
`df.head(n)` / `df.tail(n)`	First / last n rows — verify headers and format
`df.sample(n)`	n random rows — check data variety unbiasedly
`df.info()`	Row/column counts, data types, non-null counts, memory usage
`df.describe()`	Descriptive statistics (mean, std, min, max, quartiles) for numeric columns
`df.isnull().sum()`	Exact count of missing cells per column

💡 Tip — finding missing values fast df.info() shows non-null counts, but df.isnull().sum() is better for pinpointing which column has missing data — it returns an exact missing-count for each column directly.

2.2 Handling Missing Values

Why is data missing? (three mechanisms)

Type	Meaning	Example
MCAR Missing Completely At Random	No pattern at all to the missingness	A lab sample was accidentally dropped
MAR Missing At Random	Missingness depends on other observed data	Test scores missing for students who were absent
MNAR Missing Not At Random	Missingness depends on the missing value itself	High earners refusing to disclose income

Strategy 1 — Deletion

Row deletion — drop rows with missing values. Use only when missing data is small (< 5%) and random.
Column deletion — drop a whole column when it has > 60% missing data.

Python · deletion

df_dropped = df.dropna()        # drop every row containing any NaN

Strategy 2 — Imputation (Mean / Median / Mode)

Imputation means filling the gaps instead of deleting them.

Strategy	Best for	Watch out
Mean imputation	Numerical, normally-distributed (symmetric) data	Very sensitive to outliers
Median imputation	Skewed numerical data (income, house prices)	Robust against outliers — usually safer
Mode imputation	Categorical data (City, Gender)	Use the most frequent value

Python · imputation

import pandas as pd
df = pd.read_csv('loan_data.csv')

# Mean imputation — risky if outliers exist
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Median imputation — safe choice for skewed columns like Income
df['Income'] = df['Income'].fillna(df['Income'].median())

# Mode imputation — for categorical columns
df['City'] = df['City'].fillna(df['City'].mode()[0])

🧩 The Elon Musk example — why median beats mean A dataset of student incomes: 3200, 4100, 5200 … and one row of 10,000,000 (Elon Musk). The mean gets dragged sky-high by that one outlier, so filling gaps with the mean is wrong. The median (middle value) ignores the extreme, so it is the safer choice for skewed data.

Strategy 3 — Time-series filling

For time-series data (e.g. stock prices) values depend on neighbouring days:

Forward Fill (ffill) — propagate the last valid observation forward.
Backward Fill (bfill) — use the next valid observation to fill the gap.

2.3 Handling Outliers

Outlier — a data point that deviates significantly from other observations. It may be an error (Age = 200) or a valid extreme case (a billionaire in a salary dataset).

Outliers pull the mean toward them and distort analysis. Algorithms like Linear Regression are highly sensitive to them. The boxplot is the standard visual tool — its box holds the middle 50% of data, whiskers extend 1.5×IQR, and points beyond are flagged as outliers.

Method 1 — Z-Score

The Z-score tells you how many standard deviations a point is from the mean.

Z = (x − μ) / σ If Z > 3 or Z < −3, the point is considered an outlier.

Method 2 — IQR (Interquartile Range)

More robust than Z-score because it uses quartiles, not the mean/std (which are themselves affected by outliers). Invented by John Tukey.

IQR = Q3 − Q1
Lower Bound = Q1 − 1.5 × IQR | Upper Bound = Q3 + 1.5 × IQR Q1 = 25th percentile, Q3 = 75th percentile. Anything outside the bounds is an outlier.

Python · IQR outlier removal

# 1. Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)

# 2. Compute the IQR and the fences
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# 3. Keep only rows inside the fences
df_clean = df[(df['Income'] >= lower) & (df['Income'] <= upper)]
print("Rows removed:", len(df) - len(df_clean))

OutputRows removed: 1

🧩 Worked IQR by hand Sorted incomes (8 values): take the 2nd value as Q1 and the 6th as Q3. If Q1 = 3500 and Q3 = 6000, then IQR = 2500. Upper bound = 6000 + 1.5×2500 = 9750. Elon Musk's 10,000,000 > 9750 → flagged as an outlier and removed.

Handling strategies

Trimming — completely remove the outlier rows. Pro: simple. Con: data loss.
Capping (Winsorizing) — replace outliers with the upper/lower limit. Pro: preserves data size. Con: modifies the distribution.

2.4 Feature Scaling

⚠️ The "size problem" Models become biased toward features with bigger numbers. If Income ≈ 5000 and JobStability ≈ 2.5, the model may treat Income as ~2000× more important — purely because the numbers are larger, not because it matters more. Feature scaling fixes this by putting all features on a comparable range.

Min-Max Scaling (Normalization)

Squashes every value into the range [0, 1].

x' = (x − min) / (max − min) If min income = 3000 and max = 7000, then a 5000 income → (5000−3000)/(7000−3000) = 0.5

Standard Scaling (Z-Score Standardization)

Centres data around mean 0 with standard deviation 1. Handles outliers slightly better than Min-Max.

x' = (x − μ) / σ

Python · scaling with scikit-learn

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max: scales every value into [0, 1]
mm = MinMaxScaler()
df['Income'] = mm.fit_transform(df[['Income']])   # note: double brackets = 2D

# Standard: mean 0, std 1
ss = StandardScaler()
df['Salary'] = ss.fit_transform(df[['Salary']])

💡 Tip — Normalization vs Standardization Normalization (Min-Max) → fixed range [0,1], good when you know the bounds and data is not heavily skewed. Standardization (Z-score) → mean 0/std 1, no fixed range, better when there are outliers or for algorithms that assume Gaussian-like data.

2.5 Data Encoding Methods

🔑 The language barrier ML models are mathematical equations — they can only multiply, add and subtract numbers. They cannot understand text like "Male" or "PhD". Encoding translates categorical text into numbers.

Two kinds of categorical data

Ordinal data — categories with a natural order: Low < Medium < High; S < M < L < XL; B.Tech < M.Tech < PhD.
Nominal data — categories with no order: Red, Green, Blue; New York, Paris, Tokyo; Dog, Cat, Bird.

Method 1 — Label Encoding (for Ordinal data)

Assigns each category a unique integer (S=0, M=1, L=2).

Python · LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Education'] = le.fit_transform(df['Education'])
# B.Tech -> 0,  M.Tech -> 1,  PhD -> 2  (order is preserved)

⚠️ The "Gender Trap" — never label-encode nominal data If you label-encode colours as Red=0, Green=1, Blue=2, the model thinks Blue > Green > Red — a fake ranking that does not exist. The same applies to Gender (Male=0, Female=1 implies Female > Male). For nominal data, use One-Hot Encoding instead.

Method 2 — One-Hot Encoding (for Nominal data)

Creates a separate binary column for each category — 1 if present, 0 if not. No fake ordering is introduced.

Python · One-Hot Encoding

import pandas as pd

# 'Color' has values Red / Green / Blue
df = pd.get_dummies(df, columns=['Color'], drop_first=True)
# Creates columns: Color_Green, Color_Blue
# (Red is the dropped baseline)

drop_first=True drops one column to avoid the Dummy Variable Trap (multicollinearity — one column being perfectly predictable from the others).

💡 Decision rule for the exam Ordinal (has an order) → Label Encoding. Nominal (no order) → One-Hot Encoding. Getting this pairing wrong is the most common preprocessing mistake.

? Practice Questions

Choose your answer, then check it. Coding questions reveal a full solution with output.

MCQQ1Missing values

A numeric column is heavily skewed and contains outliers. The safest imputation strategy is:

A Mean imputation
B Median imputation
C Mode imputation
D Drop the entire column

Answer: B

The mean is dragged by outliers; the median (middle value) ignores extremes, making it the robust choice for skewed data. Mode is for categorical columns.

MCQQ2Missingness types

High earners systematically refusing to report their income is an example of:

A MCAR — Missing Completely At Random
B MAR — Missing At Random
C MNAR — Missing Not At Random
D None — this is not missing data

Answer: C

The missingness depends on the missing value itself (the income being high causes it to be withheld) — that is MNAR.

MCQQ3Outliers

For a dataset with Q1 = 20 and Q3 = 40, what is the upper bound of the IQR method?

A 60
B 70
C 50
D 100

Answer: B

IQR = Q3 − Q1 = 40 − 20 = 20. Upper bound = Q3 + 1.5×IQR = 40 + 1.5×20 = 40 + 30 = 70.

MCQQ4Z-Score

Using the Z-score method, a data point is typically flagged as an outlier when:

A |Z| > 1
B |Z| > 2
C |Z| > 3
D Z = 0

Answer: C

A point more than 3 standard deviations from the mean (Z > 3 or Z < −3) is treated as an outlier. Z = 0 means the point is exactly the mean.

MCQQ5Encoding

Which column should be encoded with One-Hot Encoding rather than Label Encoding?

A Size: S, M, L, XL
B Satisfaction: Low, Medium, High
C City: New York, Paris, Tokyo
D Education: B.Tech, M.Tech, PhD

Answer: C

City is nominal — no natural order — so one-hot encoding avoids inventing a fake ranking. The other three are ordinal and suit label encoding.

MCQQ6Scaling

Min-Max scaling transforms values into which range?

A [0, 1]
B [−1, 1]
C mean 0, std 1
D [−3, 3]

Answer: A

Min-Max scaling (normalization) squashes data into [0, 1]. "mean 0, std 1" describes Standard scaling instead.

MCQQ7Dummy trap

In pd.get_dummies(), what does drop_first=True prevent?

A Missing values
B The Dummy Variable Trap (multicollinearity)
C Outliers in the data
D Overfitting of the first row

Answer: B

Keeping all one-hot columns makes one column perfectly predictable from the others (multicollinearity). Dropping one removes this redundancy — the Dummy Variable Trap.

NumericalQ8Min-Max

A column has min = 3000 and max = 7000. After Min-Max scaling, what value does 5000 become?

Answer: 0.5

x' = (x − min)/(max − min) = (5000 − 3000)/(7000 − 3000) = 2000/4000 = 0.5. The minimum maps to 0, the maximum to 1, and the midpoint to 0.5.

CodingQ9Imputation

Write pandas code that loads data.csv, prints the count of missing values per column, then fills missing values in the Age column with the column's median.

Solution

Python

import pandas as pd

df = pd.read_csv('data.csv')

# Count missing values in every column
print(df.isnull().sum())

# Fill missing Age with the median (robust to outliers)
df['Age'] = df['Age'].fillna(df['Age'].median())

print("Missing in Age after fill:", df['Age'].isnull().sum())

OutputName 0 Age 3 Income 2 dtype: int64 Missing in Age after fill: 0

CodingQ10IQR outliers

Write code to remove outliers from the Salary column of a DataFrame df using the IQR method.

Solution

Python

Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
print("Before:", len(df), " After:", len(df_clean))

OutputBefore: 200 After: 194

The & combines two conditions; each condition must be wrapped in parentheses in pandas.

Short AnswerQ11Concept

Why is feature scaling necessary before training many ML models? Give a concrete example.

Model answer

Models become biased toward features with numerically larger values. For example, with Income ≈ 5000 and JobStability ≈ 2.5, the model may treat Income as far more important just because its numbers are bigger — even though both features matter. Scaling (Min-Max or Standard) puts every feature on a comparable range so each contributes fairly.

🎯 Lecture 2 — must-remember Missingness: MCAR / MAR / MNAR. Imputation: mean (symmetric), median (skewed/outliers), mode (categorical). Outliers: Z-score (|Z|>3), IQR (Q1−1.5·IQR, Q3+1.5·IQR). Scaling: Min-Max → [0,1]; Standard → mean 0, std 1. Encoding: ordinal → Label, nominal → One-Hot.

← Previous

Introduction to ML

Linear Regression