GenAI Exam Prep
Home Mock Exam
โšก LECTURE 2

Data Preprocessing

Models are only as good as the data fed to them. This lecture covers cleaning that data โ€” handling missing values, removing outliers, scaling features and encoding text into numbers.

Syllabus topics 5โ€“7 โฑ ~24 min read 11 practice questions

2.1 Why preprocessing matters

๐Ÿ”‘ Garbage In, Garbage Out Preprocessing is the most critical step in the ML pipeline. A model trained on dirty data produces unreliable predictions, no matter how advanced it is. "A simple model with clean data beats an advanced model with messy data."

Think of an ML model as a Ferrari engine and data as the fuel. Pour in contaminated fuel and even a Ferrari breaks down. The preprocessing pipeline has four stages we care about: load โ†’ inspect โ†’ clean (missing values + outliers) โ†’ transform (scale + encode).

Inspecting data first

Before cleaning anything, you must understand the data. Key pandas tools:

MethodWhat it shows
df.head(n) / df.tail(n)First / last n rows โ€” verify headers and format
df.sample(n)n random rows โ€” check data variety unbiasedly
df.info()Row/column counts, data types, non-null counts, memory usage
df.describe()Descriptive statistics (mean, std, min, max, quartiles) for numeric columns
df.isnull().sum()Exact count of missing cells per column
๐Ÿ’ก Tip โ€” finding missing values fast df.info() shows non-null counts, but df.isnull().sum() is better for pinpointing which column has missing data โ€” it returns an exact missing-count for each column directly.

2.2 Handling Missing Values

Why is data missing? (three mechanisms)

TypeMeaningExample
MCAR
Missing Completely At Random
No pattern at all to the missingnessA lab sample was accidentally dropped
MAR
Missing At Random
Missingness depends on other observed dataTest scores missing for students who were absent
MNAR
Missing Not At Random
Missingness depends on the missing value itselfHigh earners refusing to disclose income

Strategy 1 โ€” Deletion

Python ยท deletion
df_dropped = df.dropna()        # drop every row containing any NaN

Strategy 2 โ€” Imputation (Mean / Median / Mode)

Imputation means filling the gaps instead of deleting them.

StrategyBest forWatch out
Mean imputationNumerical, normally-distributed (symmetric) dataVery sensitive to outliers
Median imputationSkewed numerical data (income, house prices)Robust against outliers โ€” usually safer
Mode imputationCategorical data (City, Gender)Use the most frequent value
Python ยท imputation
import pandas as pd
df = pd.read_csv('loan_data.csv')

# Mean imputation โ€” risky if outliers exist
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Median imputation โ€” safe choice for skewed columns like Income
df['Income'] = df['Income'].fillna(df['Income'].median())

# Mode imputation โ€” for categorical columns
df['City'] = df['City'].fillna(df['City'].mode()[0])
๐Ÿงฉ The Elon Musk example โ€” why median beats mean A dataset of student incomes: 3200, 4100, 5200 โ€ฆ and one row of 10,000,000 (Elon Musk). The mean gets dragged sky-high by that one outlier, so filling gaps with the mean is wrong. The median (middle value) ignores the extreme, so it is the safer choice for skewed data.

Strategy 3 โ€” Time-series filling

For time-series data (e.g. stock prices) values depend on neighbouring days:

2.3 Handling Outliers

Outlier โ€” a data point that deviates significantly from other observations. It may be an error (Age = 200) or a valid extreme case (a billionaire in a salary dataset).

Outliers pull the mean toward them and distort analysis. Algorithms like Linear Regression are highly sensitive to them. The boxplot is the standard visual tool โ€” its box holds the middle 50% of data, whiskers extend 1.5ร—IQR, and points beyond are flagged as outliers.

Method 1 โ€” Z-Score

The Z-score tells you how many standard deviations a point is from the mean.

Z = (x โˆ’ ฮผ) / ฯƒ If Z > 3 or Z < โˆ’3, the point is considered an outlier.

Method 2 โ€” IQR (Interquartile Range)

More robust than Z-score because it uses quartiles, not the mean/std (which are themselves affected by outliers). Invented by John Tukey.

IQR = Q3 โˆ’ Q1
Lower Bound = Q1 โˆ’ 1.5 ร— IQR   |   Upper Bound = Q3 + 1.5 ร— IQR Q1 = 25th percentile, Q3 = 75th percentile. Anything outside the bounds is an outlier.
Python ยท IQR outlier removal
# 1. Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)

# 2. Compute the IQR and the fences
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# 3. Keep only rows inside the fences
df_clean = df[(df['Income'] >= lower) & (df['Income'] <= upper)]
print("Rows removed:", len(df) - len(df_clean))
OutputRows removed: 1
๐Ÿงฉ Worked IQR by hand Sorted incomes (8 values): take the 2nd value as Q1 and the 6th as Q3. If Q1 = 3500 and Q3 = 6000, then IQR = 2500. Upper bound = 6000 + 1.5ร—2500 = 9750. Elon Musk's 10,000,000 > 9750 โ†’ flagged as an outlier and removed.

Handling strategies

2.4 Feature Scaling

โš ๏ธ The "size problem" Models become biased toward features with bigger numbers. If Income โ‰ˆ 5000 and JobStability โ‰ˆ 2.5, the model may treat Income as ~2000ร— more important โ€” purely because the numbers are larger, not because it matters more. Feature scaling fixes this by putting all features on a comparable range.

Min-Max Scaling (Normalization)

Squashes every value into the range [0, 1].

x' = (x โˆ’ min) / (max โˆ’ min) If min income = 3000 and max = 7000, then a 5000 income โ†’ (5000โˆ’3000)/(7000โˆ’3000) = 0.5

Standard Scaling (Z-Score Standardization)

Centres data around mean 0 with standard deviation 1. Handles outliers slightly better than Min-Max.

x' = (x โˆ’ ฮผ) / ฯƒ
Python ยท scaling with scikit-learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max: scales every value into [0, 1]
mm = MinMaxScaler()
df['Income'] = mm.fit_transform(df[['Income']])   # note: double brackets = 2D

# Standard: mean 0, std 1
ss = StandardScaler()
df['Salary'] = ss.fit_transform(df[['Salary']])
๐Ÿ’ก Tip โ€” Normalization vs Standardization Normalization (Min-Max) โ†’ fixed range [0,1], good when you know the bounds and data is not heavily skewed. Standardization (Z-score) โ†’ mean 0/std 1, no fixed range, better when there are outliers or for algorithms that assume Gaussian-like data.

2.5 Data Encoding Methods

๐Ÿ”‘ The language barrier ML models are mathematical equations โ€” they can only multiply, add and subtract numbers. They cannot understand text like "Male" or "PhD". Encoding translates categorical text into numbers.

Two kinds of categorical data

Method 1 โ€” Label Encoding (for Ordinal data)

Assigns each category a unique integer (S=0, M=1, L=2).

Python ยท LabelEncoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Education'] = le.fit_transform(df['Education'])
# B.Tech -> 0,  M.Tech -> 1,  PhD -> 2  (order is preserved)
โš ๏ธ The "Gender Trap" โ€” never label-encode nominal data If you label-encode colours as Red=0, Green=1, Blue=2, the model thinks Blue > Green > Red โ€” a fake ranking that does not exist. The same applies to Gender (Male=0, Female=1 implies Female > Male). For nominal data, use One-Hot Encoding instead.

Method 2 โ€” One-Hot Encoding (for Nominal data)

Creates a separate binary column for each category โ€” 1 if present, 0 if not. No fake ordering is introduced.

Python ยท One-Hot Encoding
import pandas as pd

# 'Color' has values Red / Green / Blue
df = pd.get_dummies(df, columns=['Color'], drop_first=True)
# Creates columns: Color_Green, Color_Blue
# (Red is the dropped baseline)

drop_first=True drops one column to avoid the Dummy Variable Trap (multicollinearity โ€” one column being perfectly predictable from the others).

๐Ÿ’ก Decision rule for the exam Ordinal (has an order) โ†’ Label Encoding.   Nominal (no order) โ†’ One-Hot Encoding. Getting this pairing wrong is the most common preprocessing mistake.
? Practice Questions

Choose your answer, then check it. Coding questions reveal a full solution with output.

MCQQ1Missing values

A numeric column is heavily skewed and contains outliers. The safest imputation strategy is:

  • A Mean imputation
  • B Median imputation
  • C Mode imputation
  • D Drop the entire column
Answer: B

The mean is dragged by outliers; the median (middle value) ignores extremes, making it the robust choice for skewed data. Mode is for categorical columns.

MCQQ2Missingness types

High earners systematically refusing to report their income is an example of:

  • A MCAR โ€” Missing Completely At Random
  • B MAR โ€” Missing At Random
  • C MNAR โ€” Missing Not At Random
  • D None โ€” this is not missing data
Answer: C

The missingness depends on the missing value itself (the income being high causes it to be withheld) โ€” that is MNAR.

MCQQ3Outliers

For a dataset with Q1 = 20 and Q3 = 40, what is the upper bound of the IQR method?

  • A 60
  • B 70
  • C 50
  • D 100
Answer: B

IQR = Q3 โˆ’ Q1 = 40 โˆ’ 20 = 20. Upper bound = Q3 + 1.5ร—IQR = 40 + 1.5ร—20 = 40 + 30 = 70.

MCQQ4Z-Score

Using the Z-score method, a data point is typically flagged as an outlier when:

  • A |Z| > 1
  • B |Z| > 2
  • C |Z| > 3
  • D Z = 0
Answer: C

A point more than 3 standard deviations from the mean (Z > 3 or Z < โˆ’3) is treated as an outlier. Z = 0 means the point is exactly the mean.

MCQQ5Encoding

Which column should be encoded with One-Hot Encoding rather than Label Encoding?

  • A Size: S, M, L, XL
  • B Satisfaction: Low, Medium, High
  • C City: New York, Paris, Tokyo
  • D Education: B.Tech, M.Tech, PhD
Answer: C

City is nominal โ€” no natural order โ€” so one-hot encoding avoids inventing a fake ranking. The other three are ordinal and suit label encoding.

MCQQ6Scaling

Min-Max scaling transforms values into which range?

  • A [0, 1]
  • B [โˆ’1, 1]
  • C mean 0, std 1
  • D [โˆ’3, 3]
Answer: A

Min-Max scaling (normalization) squashes data into [0, 1]. "mean 0, std 1" describes Standard scaling instead.

MCQQ7Dummy trap

In pd.get_dummies(), what does drop_first=True prevent?

  • A Missing values
  • B The Dummy Variable Trap (multicollinearity)
  • C Outliers in the data
  • D Overfitting of the first row
Answer: B

Keeping all one-hot columns makes one column perfectly predictable from the others (multicollinearity). Dropping one removes this redundancy โ€” the Dummy Variable Trap.

NumericalQ8Min-Max

A column has min = 3000 and max = 7000. After Min-Max scaling, what value does 5000 become?

Answer: 0.5

x' = (x โˆ’ min)/(max โˆ’ min) = (5000 โˆ’ 3000)/(7000 โˆ’ 3000) = 2000/4000 = 0.5. The minimum maps to 0, the maximum to 1, and the midpoint to 0.5.

CodingQ9Imputation

Write pandas code that loads data.csv, prints the count of missing values per column, then fills missing values in the Age column with the column's median.

Solution
Python
import pandas as pd

df = pd.read_csv('data.csv')

# Count missing values in every column
print(df.isnull().sum())

# Fill missing Age with the median (robust to outliers)
df['Age'] = df['Age'].fillna(df['Age'].median())

print("Missing in Age after fill:", df['Age'].isnull().sum())
OutputName 0 Age 3 Income 2 dtype: int64 Missing in Age after fill: 0
CodingQ10IQR outliers

Write code to remove outliers from the Salary column of a DataFrame df using the IQR method.

Solution
Python
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
print("Before:", len(df), " After:", len(df_clean))
OutputBefore: 200 After: 194

The & combines two conditions; each condition must be wrapped in parentheses in pandas.

Short AnswerQ11Concept

Why is feature scaling necessary before training many ML models? Give a concrete example.

Model answer

Models become biased toward features with numerically larger values. For example, with Income โ‰ˆ 5000 and JobStability โ‰ˆ 2.5, the model may treat Income as far more important just because its numbers are bigger โ€” even though both features matter. Scaling (Min-Max or Standard) puts every feature on a comparable range so each contributes fairly.

๐ŸŽฏ Lecture 2 โ€” must-remember Missingness: MCAR / MAR / MNAR. Imputation: mean (symmetric), median (skewed/outliers), mode (categorical). Outliers: Z-score (|Z|>3), IQR (Q1โˆ’1.5ยทIQR, Q3+1.5ยทIQR). Scaling: Min-Max โ†’ [0,1]; Standard โ†’ mean 0, std 1. Encoding: ordinal โ†’ Label, nominal โ†’ One-Hot.