Decision Trees
A model that thinks in nested if-else questions. Learn how trees split data, how impurity (Gini & Information Gain) decides the best split, and how trees handle both classification and regression.
In this lecture
6.1 Introduction to Decision Trees
Anatomy of a tree
- Root node β the first/topmost question, applied to all data.
- Internal (decision) nodes β further questions that split data.
- Branches β the yes/no outcomes of a question.
- Leaf (terminal) nodes β final predictions; no more splitting.
The tree learns by asking: "Which feature, split where, best separates the classes?" The "best" split is the one that makes the resulting groups as pure as possible.
6.2 Impurity Measure 1 β Entropy & Information Gain
Entropy = β(0.5Β·logβ0.5 + 0.5Β·logβ0.5) = β(0.5Β·(β1) + 0.5Β·(β1)) = 1.0 β maximum disorder.
If instead all 8 were "Yes": Entropy = β(1Β·logβ1) = 0 β perfectly pure.
6.3 Impurity Measure 2 β Gini Index
The Gini Index is the other common impurity measure (it is scikit-learn's default). It is faster to compute than entropy because it avoids logarithms.
Choosing the best split β weighted average
For a candidate split, compute the Gini of each child branch, then take the weighted average (weighted by how many samples fall in each branch). The split with the lowest weighted Gini wins.
Left branch (Duration < 60s) β all "Viral": Gini = 1 β (1Β² + 0Β²) = 0.
Right branch (Duration β₯ 60s) β all "Not Viral": Gini = 1 β (0Β² + 1Β²) = 0.
Weighted Gini = 0. A perfect split! Compare to splitting on "Trending Audio" which leaves a messy 2-Yes/1-No branch (Gini β 0.44). The tree picks Duration because its weighted Gini is lower.
# A node with 3 "Yes" and 1 "No"
p_yes = 3 / 4
p_no = 1 / 4
gini = 1 - (p_yes**2 + p_no**2)
print("Gini:", gini)
6.4 Advantages vs Disadvantages
| Advantages β | Disadvantages β |
|---|---|
| Easy to understand & visualise (white-box model) | Prone to overfitting β can memorise training data |
| Handles both numerical & categorical data | Unstable β small data changes can produce a very different tree |
| Captures non-linear patterns (unlike Linear/Logistic Regression) | A single deep tree can be inaccurate vs ensembles |
| No feature scaling required | Can create biased trees if classes are imbalanced |
| Implicitly does feature selection (important features split first) | Greedy splitting β not guaranteed to find the globally optimal tree |
6.5 Regression Trees
Decision trees are not just for classification. A regression tree predicts a continuous number.
Key differences from a classification tree:
- Splits are chosen to minimise variance (or MSE) within the child nodes, not Gini/entropy.
- The leaf prediction is the mean of the samples in that leaf.
- The scikit-learn class is
DecisionTreeRegressorinstead ofDecisionTreeClassifier.
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor # Classification tree β predicts a category, uses Gini by default clf = DecisionTreeClassifier(criterion='gini') # or 'entropy' clf.fit(X_train, y_train) # Regression tree β predicts a number, minimises squared error reg = DecisionTreeRegressor() reg.fit(X_train, y_train) print(reg.predict([[1400]]))
6.6 Overfitting & Hyperparameter Tuning
Pruning with hyperparameters
We limit the tree's growth using hyperparameters:
| Hyperparameter | Effect |
|---|---|
max_depth | Maximum number of levels β the main brake on overfitting |
min_samples_split | Minimum samples a node needs before it is allowed to split |
min_samples_leaf | Minimum samples required in a leaf node |
criterion | 'gini' or 'entropy' β the impurity measure |
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# 1. Unconstrained tree -> overfits
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Train acc:", accuracy_score(y_train, model.predict(X_train)))
print("Test acc:", accuracy_score(y_test, model.predict(X_test)))
# 2. Constrained tree -> better generalisation
model = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5)
model.fit(X_train, y_train)
Notice training accuracy 1.0 vs test 0.84 in the unconstrained tree β classic overfitting. Limiting max_depth closes that gap.
param_grid = {
'max_depth': [2, 3, 4, 5],
'min_samples_split':[2, 5, 10],
'min_samples_leaf': [1, 2, 5]
}
# cv=5 -> 5-fold cross-validation on every combination
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)
plot_tree(model, feature_names=[...], class_names=[...], filled=True) from sklearn.tree to draw the tree. Each box shows the split condition, the Gini value, sample count and class β great for understanding what the model learned.
Impurity calculations and overfitting are frequent exam topics β practice them here.
Decision Trees can model a relationship that goes up and then down (non-monotonic). Logistic Regression cannot, because it:
Logistic Regression's straight-line boundary assumes a monotonic relationship. Trees split into ranges, so they handle "lowβ0, mediumβ1, highβ0" easily.
A node where all samples belong to the same class has a Gini index of:
Gini = 1 β Ξ£pΒ². If one class has p = 1: Gini = 1 β (1Β²) = 0. A pure node has zero impurity β exactly the goal of splitting.
When choosing the root node, a decision tree picks the feature with the:
Information Gain measures how much entropy (disorder) a split removes. The tree greedily picks the split with the highest IG β the one that produces the purest children.
For a binary classification node, entropy is at its maximum when:
A 50/50 split is maximum uncertainty β entropy = 1.0 (binary). A pure node (all one class) has entropy 0.
An unconstrained decision tree gives 100% training accuracy but only 78% test accuracy. The best fix is:
The gap signals overfitting. Pruning hyperparameters β max_depth, min_samples_leaf, min_samples_split β limit growth and improve generalisation.
In a regression tree, what does a leaf node typically output?
Regression trees predict a continuous number β the leaf outputs the mean target value of the training samples that reached it.
Which is a genuine advantage of decision trees?
Trees need no scaling, handle mixed data types, and are white-box (easy to visualise). They do overfit easily β that is a disadvantage, not an advantage.
A node has 6 "Yes" and 4 "No" samples. Compute its Gini index.
p(Yes) = 6/10 = 0.6, p(No) = 4/10 = 0.4. Gini = 1 β (0.6Β² + 0.4Β²) = 1 β (0.36 + 0.16) = 1 β 0.52 = 0.48.
A split produces two pure leaves: Left (all Yes) and Right (all No). What is the weighted Gini of this split, and is it a good split?
Each leaf is pure β Gini of each = 1 β (1Β²) = 0. The weighted average of two zeros is 0. A weighted Gini of 0 is the best possible result β the split perfectly separates the classes.
Write a Python function gini(counts) that takes a list of class counts and returns the Gini index. Test it on [3, 1].
def gini(counts):
total = sum(counts)
impurity = 1
for c in counts:
p = c / total
impurity -= p ** 2
return impurity
print("Gini [3,1]:", gini([3, 1]))
print("Gini [2,2]:", gini([2, 2])) # max impurity
print("Gini [4,0]:", gini([4, 0])) # pure
Train a Decision Tree classifier with max_depth=3, then print its training and test accuracy.
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, model.predict(X_test))
print("Train accuracy:", round(train_acc, 3))
print("Test accuracy :", round(test_acc, 3))
With max_depth=3 the train/test gap is small β the tree generalises well rather than memorising.
State one advantage and one disadvantage of decision trees, and name two hyperparameters used to control overfitting.
Advantage: they are easy to interpret/visualise and need no feature scaling. Disadvantage: they overfit easily β an unconstrained tree memorises the training data. Hyperparameters to control overfitting: max_depth (limits levels) and min_samples_leaf / min_samples_split (require enough samples before forming leaves/splits).
max_depth, min_samples_leaf, min_samples_split.