In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score
)
In [2]:
df = pd.read_csv("../data/classification.csv")
df.head()
Out[2]:
| feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | target | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.189853 | -2.853724 | -0.963144 | 0.318181 | 1.284353 | -1.228444 | -1.074027 | -1.831145 | 0 |
| 1 | 0.999778 | 1.436418 | 0.482381 | -1.586100 | -0.634295 | -0.393530 | 2.602929 | -1.119546 | 1 |
| 2 | 2.832780 | -0.761251 | -2.159733 | -0.235862 | -0.383240 | -0.322028 | -1.734361 | 1.614779 | 0 |
| 3 | -1.937572 | 0.665805 | -1.500188 | 2.501383 | 2.006982 | 0.359476 | 1.857416 | -0.478759 | 1 |
| 4 | -1.040299 | -1.983650 | -4.728988 | 4.836235 | 3.758709 | -1.968504 | -2.065338 | 0.034714 | 0 |
Dataset¶
The dataset contains:
- numerical features
- a binary target variable
Random Forest can handle features with different scales, so feature scaling is not required.
In [3]:
X = df.drop("target", axis=1)
y = df["target"]
In [4]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Model Overview¶
Random Forest is an ensemble model composed of many decision trees.
Key ideas:
- each tree is trained on a random subset of data
- each split considers a random subset of features
- final prediction is based on majority voting
This reduces overfitting compared to a single decision tree.
In [5]:
model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
model.fit(X_train, y_train)
Out[5]:
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Prediction¶
Predictions are obtained by aggregating the predictions of all decision trees in the forest.
In [6]:
y_pred = model.predict(X_test)
Evaluation¶
Random Forest is evaluated using standard classification metrics:
- confusion matrix
- accuracy
- precision, recall and F1-score
In [7]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print()
print(confusion_matrix(y_test, y_pred))
print()
print(classification_report(y_test, y_pred))
Accuracy: 0.9083333333333333
[[82 1]
[10 27]]
precision recall f1-score support
0 0.89 0.99 0.94 83
1 0.96 0.73 0.83 37
accuracy 0.91 120
macro avg 0.93 0.86 0.88 120
weighted avg 0.91 0.91 0.90 120
Feature Importance¶
Random Forest provides an estimate of feature importance based on how much each feature contributes to reducing impurity.
In [8]:
importances = pd.Series(
model.feature_importances_,
index=X.columns
).sort_values(ascending=False)
importances
Out[8]:
feature_1 0.176372 feature_4 0.170517 feature_0 0.160568 feature_6 0.154211 feature_3 0.127211 feature_2 0.112260 feature_5 0.049614 feature_7 0.049246 dtype: float64
Interpretation¶
Key characteristics of Random Forest:
- non-linear model
- robust to noise
- handles complex interactions
- less interpretable than linear models
- usually strong baseline for classification problems