⚡ LECTURE 6

Decision Trees

A model that thinks in nested if-else questions. Learn how trees split data, how impurity (Gini & Information Gain) decides the best split, and how trees handle both classification and regression.

Syllabus topics 20–23 ⏱ ~26 min read 12 practice questions

In this lecture

Introduction to Decision Trees
Impurity: Entropy & Information Gain
Impurity: Gini Index
Advantages vs Disadvantages
Regression Trees
Overfitting & Hyperparameter Tuning
Practice Questions

6.1 Introduction to Decision Trees

💊 Why we need trees — the drug-dosage story A drug trial: low dose is ineffective (0), medium dose works (1), high dose becomes toxic (0). The outcome goes up then down. Logistic Regression assumes "more X → more probability of Y" — it cannot model this non-monotonic data. But a simple nested rule can: "if dosage < 3.5 → 0; else if dosage < 8 → 1; else → 0." That nested rule is a decision tree.

Decision Tree — a supervised model that repeatedly splits data using yes/no questions on features, forming a tree of decisions. Each internal node tests a feature, each branch is an outcome, and each leaf gives a final prediction.

Anatomy of a tree

Root node — the first/topmost question, applied to all data.
Internal (decision) nodes — further questions that split data.
Branches — the yes/no outcomes of a question.
Leaf (terminal) nodes — final predictions; no more splitting.

The tree learns by asking: "Which feature, split where, best separates the classes?" The "best" split is the one that makes the resulting groups as pure as possible.

Purity — a node is pure if all its samples belong to one class. Impurity measures how mixed a node is. The tree always picks the split that reduces impurity the most.

6.2 Impurity Measure 1 — Entropy & Information Gain

Entropy — a measure of disorder/uncertainty in a node. 0 = perfectly pure (all one class); 1 = maximum disorder (a 50/50 mix, for binary classes).

Entropy = − Σ p_i · log₂(p_i) p_i = proportion of class i in the node.

Information Gain (IG) — how much entropy is reduced by a split. The tree chooses the feature with the highest Information Gain.

IG = Entropy(parent) − Σ (weighted Entropy of children) Higher IG = the split removed more uncertainty = a better split.

🧩 Worked example — entropy of a node A node has 8 samples: 4 "Yes" and 4 "No". p(Yes)=0.5, p(No)=0.5.
Entropy = −(0.5·log₂0.5 + 0.5·log₂0.5) = −(0.5·(−1) + 0.5·(−1)) = 1.0 — maximum disorder.
If instead all 8 were "Yes": Entropy = −(1·log₂1) = 0 — perfectly pure.

6.3 Impurity Measure 2 — Gini Index

The Gini Index is the other common impurity measure (it is scikit-learn's default). It is faster to compute than entropy because it avoids logarithms.

Gini = 1 − Σ (p_i)² Gini = 0 → perfect purity (the goal). Gini = 0.5 → maximum impurity for binary classes (the enemy).

Choosing the best split — weighted average

For a candidate split, compute the Gini of each child branch, then take the weighted average (weighted by how many samples fall in each branch). The split with the lowest weighted Gini wins.

🧩 Worked example — "Will the reel go viral?" (from the worksheet) Splitting on Duration < 60s produces two pure branches:
Left branch (Duration < 60s) → all "Viral": Gini = 1 − (1² + 0²) = 0.
Right branch (Duration ≥ 60s) → all "Not Viral": Gini = 1 − (0² + 1²) = 0.
Weighted Gini = 0. A perfect split! Compare to splitting on "Trending Audio" which leaves a messy 2-Yes/1-No branch (Gini ≈ 0.44). The tree picks Duration because its weighted Gini is lower.

🔑 Gini vs Information Gain Both measure node impurity and both pick the split that maximises purity. Gini = 1 − Σp² (no logs, faster, sklearn default). Entropy/Information Gain uses log₂ (slightly more expensive). In practice they give very similar trees. Gini ranges 0→0.5; Entropy ranges 0→1 (binary).

Python · computing Gini by hand

# A node with 3 "Yes" and 1 "No"
p_yes = 3 / 4
p_no  = 1 / 4

gini = 1 - (p_yes**2 + p_no**2)
print("Gini:", gini)

OutputGini: 0.375

6.4 Advantages vs Disadvantages

Advantages ✅	Disadvantages ❌
Easy to understand & visualise (white-box model)	Prone to overfitting — can memorise training data
Handles both numerical & categorical data	Unstable — small data changes can produce a very different tree
Captures non-linear patterns (unlike Linear/Logistic Regression)	A single deep tree can be inaccurate vs ensembles
No feature scaling required	Can create biased trees if classes are imbalanced
Implicitly does feature selection (important features split first)	Greedy splitting — not guaranteed to find the globally optimal tree

6.5 Regression Trees

Decision trees are not just for classification. A regression tree predicts a continuous number.

Regression Tree — a decision tree whose leaves output a number (typically the average of the target values of the training samples that landed in that leaf), instead of a class label.

Key differences from a classification tree:

Splits are chosen to minimise variance (or MSE) within the child nodes, not Gini/entropy.
The leaf prediction is the mean of the samples in that leaf.
The scikit-learn class is DecisionTreeRegressor instead of DecisionTreeClassifier.

Python · classification vs regression tree

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Classification tree — predicts a category, uses Gini by default
clf = DecisionTreeClassifier(criterion='gini')   # or 'entropy'
clf.fit(X_train, y_train)

# Regression tree — predicts a number, minimises squared error
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
print(reg.predict([[1400]]))

6.6 Overfitting & Hyperparameter Tuning

⚠️ The overfitting problem An unconstrained tree keeps splitting until every leaf is pure — it memorises the training data. Result: training accuracy near 100% but much lower test accuracy. That gap is overfitting (high variance — see Lecture 5).

Pruning with hyperparameters

We limit the tree's growth using hyperparameters:

Hyperparameter	Effect
`max_depth`	Maximum number of levels — the main brake on overfitting
`min_samples_split`	Minimum samples a node needs before it is allowed to split
`min_samples_leaf`	Minimum samples required in a leaf node
`criterion`	`'gini'` or `'entropy'` — the impurity measure

Python · building & tuning a tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# 1. Unconstrained tree -> overfits
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test  acc:", accuracy_score(y_test,  model.predict(X_test)))

# 2. Constrained tree -> better generalisation
model = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
model.fit(X_train, y_train)

OutputTrain acc: 1.0 Test acc: 0.84

Notice training accuracy 1.0 vs test 0.84 in the unconstrained tree — classic overfitting. Limiting max_depth closes that gap.

Python · automated tuning with GridSearchCV

param_grid = {
    'max_depth':        [2, 3, 4, 5],
    'min_samples_split':[2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}
# cv=5 -> 5-fold cross-validation on every combination
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)

OutputBest Params: {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 5}

💡 Tip — visualising a tree Use plot_tree(model, feature_names=[...], class_names=[...], filled=True) from sklearn.tree to draw the tree. Each box shows the split condition, the Gini value, sample count and class — great for understanding what the model learned.

? Practice Questions

Impurity calculations and overfitting are frequent exam topics — practice them here.

MCQQ1Why trees

Decision Trees can model a relationship that goes up and then down (non-monotonic). Logistic Regression cannot, because it:

A Needs more data
B Assumes more of a feature always means more probability of the outcome
C Cannot handle numbers
D Always overfits

Answer: B

Logistic Regression's straight-line boundary assumes a monotonic relationship. Trees split into ranges, so they handle "low→0, medium→1, high→0" easily.

MCQQ2Purity

A node where all samples belong to the same class has a Gini index of:

A 0
B 0.5
C 1
D Infinity

Answer: A

Gini = 1 − Σp². If one class has p = 1: Gini = 1 − (1²) = 0. A pure node has zero impurity — exactly the goal of splitting.

MCQQ3Information Gain

When choosing the root node, a decision tree picks the feature with the:

A Lowest Information Gain
B Highest Information Gain
C Most missing values
D Largest number of categories

Answer: B

Information Gain measures how much entropy (disorder) a split removes. The tree greedily picks the split with the highest IG — the one that produces the purest children.

MCQQ4Entropy

For a binary classification node, entropy is at its maximum when:

A All samples are one class
B The samples are a 50/50 mix of both classes
C There is only one sample
D The node is a leaf

Answer: B

A 50/50 split is maximum uncertainty → entropy = 1.0 (binary). A pure node (all one class) has entropy 0.

MCQQ5Overfitting

An unconstrained decision tree gives 100% training accuracy but only 78% test accuracy. The best fix is:

A Remove the test set
B Limit tree growth, e.g. set max_depth
C Train for more epochs
D Switch the criterion to entropy

Answer: B

The gap signals overfitting. Pruning hyperparameters — max_depth, min_samples_leaf, min_samples_split — limit growth and improve generalisation.

MCQQ6Regression tree

In a regression tree, what does a leaf node typically output?

A A class label
B The average of the target values of the samples in that leaf
C The Gini index
D A probability between 0 and 1

Answer: B

Regression trees predict a continuous number — the leaf outputs the mean target value of the training samples that reached it.

MCQQ7Advantages

Which is a genuine advantage of decision trees?

A They never overfit
B They require no feature scaling and are easy to interpret
C They can only handle numeric data
D They are always more accurate than any other model

Answer: B

Trees need no scaling, handle mixed data types, and are white-box (easy to visualise). They do overfit easily — that is a disadvantage, not an advantage.

NumericalQ8Gini

A node has 6 "Yes" and 4 "No" samples. Compute its Gini index.

Answer: 0.48

p(Yes) = 6/10 = 0.6, p(No) = 4/10 = 0.4. Gini = 1 − (0.6² + 0.4²) = 1 − (0.36 + 0.16) = 1 − 0.52 = 0.48.

NumericalQ9Gini weighted

A split produces two pure leaves: Left (all Yes) and Right (all No). What is the weighted Gini of this split, and is it a good split?

Answer: 0 — a perfect split

Each leaf is pure → Gini of each = 1 − (1²) = 0. The weighted average of two zeros is 0. A weighted Gini of 0 is the best possible result — the split perfectly separates the classes.

CodingQ10Compute Gini

Write a Python function gini(counts) that takes a list of class counts and returns the Gini index. Test it on [3, 1].

Solution

Python

def gini(counts):
    total = sum(counts)
    impurity = 1
    for c in counts:
        p = c / total
        impurity -= p ** 2
    return impurity

print("Gini [3,1]:", gini([3, 1]))
print("Gini [2,2]:", gini([2, 2]))   # max impurity
print("Gini [4,0]:", gini([4, 0]))   # pure

OutputGini [3,1]: 0.375 Gini [2,2]: 0.5 Gini [4,0]: 0.0

CodingQ11Train a tree

Train a Decision Tree classifier with max_depth=3, then print its training and test accuracy.

Solution

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc  = accuracy_score(y_test,  model.predict(X_test))
print("Train accuracy:", round(train_acc, 3))
print("Test accuracy :", round(test_acc, 3))

OutputTrain accuracy: 0.91 Test accuracy : 0.88

With max_depth=3 the train/test gap is small — the tree generalises well rather than memorising.

Short AnswerQ12Concept

State one advantage and one disadvantage of decision trees, and name two hyperparameters used to control overfitting.

Model answer

Advantage: they are easy to interpret/visualise and need no feature scaling. Disadvantage: they overfit easily — an unconstrained tree memorises the training data. Hyperparameters to control overfitting: max_depth (limits levels) and min_samples_leaf / min_samples_split (require enough samples before forming leaves/splits).

🎯 Lecture 6 — must-remember Tree = root → internal nodes → leaves. Splits chosen to maximise purity. Entropy = −Σp·log₂p (0→1); Information Gain = entropy reduction (pick highest). Gini = 1−Σp² (0→0.5, sklearn default). Regression tree → leaf outputs the mean. Trees overfit → control with max_depth, min_samples_leaf, min_samples_split.

← Previous

Model Evaluation

Intro to Neural Networks