Data Preprocessing
Models are only as good as the data fed to them. This lecture covers cleaning that data โ handling missing values, removing outliers, scaling features and encoding text into numbers.
In this lecture
2.1 Why preprocessing matters
Think of an ML model as a Ferrari engine and data as the fuel. Pour in contaminated fuel and even a Ferrari breaks down. The preprocessing pipeline has four stages we care about: load โ inspect โ clean (missing values + outliers) โ transform (scale + encode).
Inspecting data first
Before cleaning anything, you must understand the data. Key pandas tools:
| Method | What it shows |
|---|---|
df.head(n) / df.tail(n) | First / last n rows โ verify headers and format |
df.sample(n) | n random rows โ check data variety unbiasedly |
df.info() | Row/column counts, data types, non-null counts, memory usage |
df.describe() | Descriptive statistics (mean, std, min, max, quartiles) for numeric columns |
df.isnull().sum() | Exact count of missing cells per column |
df.info() shows non-null counts, but df.isnull().sum() is better for pinpointing which column has missing data โ it returns an exact missing-count for each column directly.
2.2 Handling Missing Values
Why is data missing? (three mechanisms)
| Type | Meaning | Example |
|---|---|---|
| MCAR Missing Completely At Random | No pattern at all to the missingness | A lab sample was accidentally dropped |
| MAR Missing At Random | Missingness depends on other observed data | Test scores missing for students who were absent |
| MNAR Missing Not At Random | Missingness depends on the missing value itself | High earners refusing to disclose income |
Strategy 1 โ Deletion
- Row deletion โ drop rows with missing values. Use only when missing data is small (< 5%) and random.
- Column deletion โ drop a whole column when it has > 60% missing data.
df_dropped = df.dropna() # drop every row containing any NaN
Strategy 2 โ Imputation (Mean / Median / Mode)
Imputation means filling the gaps instead of deleting them.
| Strategy | Best for | Watch out |
|---|---|---|
| Mean imputation | Numerical, normally-distributed (symmetric) data | Very sensitive to outliers |
| Median imputation | Skewed numerical data (income, house prices) | Robust against outliers โ usually safer |
| Mode imputation | Categorical data (City, Gender) | Use the most frequent value |
import pandas as pd
df = pd.read_csv('loan_data.csv')
# Mean imputation โ risky if outliers exist
df['Age'] = df['Age'].fillna(df['Age'].mean())
# Median imputation โ safe choice for skewed columns like Income
df['Income'] = df['Income'].fillna(df['Income'].median())
# Mode imputation โ for categorical columns
df['City'] = df['City'].fillna(df['City'].mode()[0])
Strategy 3 โ Time-series filling
For time-series data (e.g. stock prices) values depend on neighbouring days:
- Forward Fill (
ffill) โ propagate the last valid observation forward. - Backward Fill (
bfill) โ use the next valid observation to fill the gap.
2.3 Handling Outliers
Outliers pull the mean toward them and distort analysis. Algorithms like Linear Regression are highly sensitive to them. The boxplot is the standard visual tool โ its box holds the middle 50% of data, whiskers extend 1.5รIQR, and points beyond are flagged as outliers.
Method 1 โ Z-Score
The Z-score tells you how many standard deviations a point is from the mean.
Method 2 โ IQR (Interquartile Range)
More robust than Z-score because it uses quartiles, not the mean/std (which are themselves affected by outliers). Invented by John Tukey.
Lower Bound = Q1 โ 1.5 ร IQR | Upper Bound = Q3 + 1.5 ร IQR Q1 = 25th percentile, Q3 = 75th percentile. Anything outside the bounds is an outlier.
# 1. Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
# 2. Compute the IQR and the fences
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# 3. Keep only rows inside the fences
df_clean = df[(df['Income'] >= lower) & (df['Income'] <= upper)]
print("Rows removed:", len(df) - len(df_clean))
Handling strategies
- Trimming โ completely remove the outlier rows. Pro: simple. Con: data loss.
- Capping (Winsorizing) โ replace outliers with the upper/lower limit. Pro: preserves data size. Con: modifies the distribution.
2.4 Feature Scaling
Income โ 5000 and JobStability โ 2.5, the model may treat Income as ~2000ร more important โ purely because the numbers are larger, not because it matters more. Feature scaling fixes this by putting all features on a comparable range.
Min-Max Scaling (Normalization)
Squashes every value into the range [0, 1].
Standard Scaling (Z-Score Standardization)
Centres data around mean 0 with standard deviation 1. Handles outliers slightly better than Min-Max.
from sklearn.preprocessing import MinMaxScaler, StandardScaler # Min-Max: scales every value into [0, 1] mm = MinMaxScaler() df['Income'] = mm.fit_transform(df[['Income']]) # note: double brackets = 2D # Standard: mean 0, std 1 ss = StandardScaler() df['Salary'] = ss.fit_transform(df[['Salary']])
2.5 Data Encoding Methods
Two kinds of categorical data
- Ordinal data โ categories with a natural order: Low < Medium < High; S < M < L < XL; B.Tech < M.Tech < PhD.
- Nominal data โ categories with no order: Red, Green, Blue; New York, Paris, Tokyo; Dog, Cat, Bird.
Method 1 โ Label Encoding (for Ordinal data)
Assigns each category a unique integer (S=0, M=1, L=2).
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['Education'] = le.fit_transform(df['Education']) # B.Tech -> 0, M.Tech -> 1, PhD -> 2 (order is preserved)
Method 2 โ One-Hot Encoding (for Nominal data)
Creates a separate binary column for each category โ 1 if present, 0 if not. No fake ordering is introduced.
import pandas as pd # 'Color' has values Red / Green / Blue df = pd.get_dummies(df, columns=['Color'], drop_first=True) # Creates columns: Color_Green, Color_Blue # (Red is the dropped baseline)
drop_first=True drops one column to avoid the Dummy Variable Trap (multicollinearity โ one column being perfectly predictable from the others).
Choose your answer, then check it. Coding questions reveal a full solution with output.
A numeric column is heavily skewed and contains outliers. The safest imputation strategy is:
The mean is dragged by outliers; the median (middle value) ignores extremes, making it the robust choice for skewed data. Mode is for categorical columns.
High earners systematically refusing to report their income is an example of:
The missingness depends on the missing value itself (the income being high causes it to be withheld) โ that is MNAR.
For a dataset with Q1 = 20 and Q3 = 40, what is the upper bound of the IQR method?
IQR = Q3 โ Q1 = 40 โ 20 = 20. Upper bound = Q3 + 1.5รIQR = 40 + 1.5ร20 = 40 + 30 = 70.
Using the Z-score method, a data point is typically flagged as an outlier when:
A point more than 3 standard deviations from the mean (Z > 3 or Z < โ3) is treated as an outlier. Z = 0 means the point is exactly the mean.
Which column should be encoded with One-Hot Encoding rather than Label Encoding?
City is nominal โ no natural order โ so one-hot encoding avoids inventing a fake ranking. The other three are ordinal and suit label encoding.
Min-Max scaling transforms values into which range?
Min-Max scaling (normalization) squashes data into [0, 1]. "mean 0, std 1" describes Standard scaling instead.
In pd.get_dummies(), what does drop_first=True prevent?
Keeping all one-hot columns makes one column perfectly predictable from the others (multicollinearity). Dropping one removes this redundancy โ the Dummy Variable Trap.
A column has min = 3000 and max = 7000. After Min-Max scaling, what value does 5000 become?
x' = (x โ min)/(max โ min) = (5000 โ 3000)/(7000 โ 3000) = 2000/4000 = 0.5. The minimum maps to 0, the maximum to 1, and the midpoint to 0.5.
Write pandas code that loads data.csv, prints the count of missing values per column, then fills missing values in the Age column with the column's median.
import pandas as pd
df = pd.read_csv('data.csv')
# Count missing values in every column
print(df.isnull().sum())
# Fill missing Age with the median (robust to outliers)
df['Age'] = df['Age'].fillna(df['Age'].median())
print("Missing in Age after fill:", df['Age'].isnull().sum())
Write code to remove outliers from the Salary column of a DataFrame df using the IQR method.
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_clean = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
print("Before:", len(df), " After:", len(df_clean))
The & combines two conditions; each condition must be wrapped in parentheses in pandas.
Why is feature scaling necessary before training many ML models? Give a concrete example.
Models become biased toward features with numerically larger values. For example, with Income โ 5000 and JobStability โ 2.5, the model may treat Income as far more important just because its numbers are bigger โ even though both features matter. Scaling (Min-Max or Standard) puts every feature on a comparable range so each contributes fairly.