Supervised Learning → Random Forest Regression¶
As with other regression models in this project, the first sections focus on data preparation and are intentionally repeated.
This ensures that each notebook can be read and used independently, without external references.
- Project setup and common pipeline
- Dataset loading
- Train-test split
- Feature scaling (why we do it)
- What is this model? (Intuition)
- Model training
- Model behavior and key hyperparameters
- Predictions
- Model evaluation
- When to use it and when not to
- Model persistence
- Mathematical formulation (deep dive)
- Final summary – Code only
How this notebook should be read¶
This notebook is designed to be read top to bottom.
Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process
The goal is not just to run the code, but to understand what is happening at each step and be able to adapt it to your own data.
What is Random Forest Regression?¶
Random Forest Regression is a powerful and flexible model that works very differently from both Linear Regression and KNN.
Instead of learning a single global rule or relying on nearby data points, Random Forest builds many decision trees and combines their predictions.
Each tree:
- looks at the data in a slightly different way
- makes its own prediction
The final prediction is obtained by:
- averaging the predictions of all trees
By combining many simple models, Random Forest is able to capture complex and non-linear patterns that simpler models cannot.
Why we start with intuition¶
Random Forest can look complex at first, but the core idea is simple.
Instead of trusting a single model, Random Forest asks the same question many times, each time with a slightly different perspective.
Each tree may make mistakes, but when many trees agree, the final prediction becomes more reliable.
Understanding this idea of many weak models working together is key to understanding how Random Forest works.
What you should expect from the results¶
Before using Random Forest Regression, it is important to set expectations.
With Random Forest Regression, you should expect:
- strong performance on complex and non-linear data
- robust predictions even in the presence of noise
- less sensitivity to individual outliers
Compared to simpler models:
- Random Forest usually outperforms Linear Regression
- it is more stable than KNN on large datasets
- it requires less manual feature engineering
However, this power comes at a cost:
- the model is harder to interpret
- training and prediction are more computationally expensive
____________________________________¶
1. Project setup and common pipeline¶
In this section we set up the common pipeline used across all regression models in this project.
This part of the notebook does not depend on the model itself and is intentionally kept consistent to:
- ensure fair comparison between models
- reduce implementation errors
- focus on understanding model behavior
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
mean_absolute_error,
mean_squared_error,
r2_score
)
from pathlib import Path
import joblib
Note on feature scaling for Random Forest¶
Random Forest does not rely on distances or linear combinations of features.
For this reason, feature scaling is not required for the model to work correctly.
However, we keep feature scaling in the pipeline to:
- maintain consistency across all regression notebooks
- simplify comparisons between models
- avoid changing preprocessing steps when switching models
# Load the dataset
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
Inputs and target¶
Xcontains the input featuresycontains the continuous target variable
At this stage:
- no modeling is performed
- we are only defining what the model will learn from
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Why this step is important¶
A model should not be evaluated on the same data it was trained on.
By separating the data:
- the training set is used to build the model
- the test set is used only for evaluation
This ensures that performance metrics reflect real-world behavior.
Consistency across models¶
We use the same split configuration as in the other regression notebooks.
This guarantees that performance differences are due to the model itself and not to different data splits.
____________________________________¶
4. Feature scaling (pipeline consistency)¶
In this section we apply feature scaling to the input features.
Although Random Forest does not rely on distances or linear combinations of features, we keep feature scaling in the pipeline to maintain consistency across models.
Is feature scaling required for Random Forest?¶
Strictly speaking:
- Random Forest does not require feature scaling
Decision trees:
- split features based on thresholds
- are invariant to feature scale
However, scaling is still applied here for consistency and comparability.
Why we keep the same pipeline across models¶
In practice, once a problem is identified as a regression task, it is common to try multiple models and compare their performance on the same data.
For this reason:
- the dataset remains identical
- the train-test split remains identical
- the preprocessing pipeline remains identical
Keeping the same pipeline allows us to:
- compare models fairly
- attribute performance differences to the model itself
- switch models without changing data preparation steps
Even if a model does not strictly require scaling, including it in the pipeline ensures consistency and simplifies model comparison.
Why we still apply scaling¶
We apply feature scaling to:
- keep the preprocessing pipeline identical
- simplify switching between models
- avoid conditional logic in the code
This makes the project easier to maintain and easier to extend.
Important rule: fit only on training data¶
As with other preprocessing steps:
- the scaler is fitted on training data only
- the same scaler is applied to test data
This prevents data leakage and ensures a fair evaluation.
# Feature scaling (kept for pipeline consistency)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
What we have after this step¶
- scaled training data
- scaled test data
- a complete and consistent preprocessing pipeline
At this point, the data is ready to be used by the Random Forest model.
____________________________________¶
5. What is this model? (Random Forest Regression)¶
Before training the model, it is important to understand what Random Forest Regression is trying to do conceptually.
Random Forest Regression is an ensemble model, meaning it combines the predictions of many simpler models to produce a final result.
The core idea¶
Instead of relying on a single model, Random Forest builds many decision trees.
Each tree:
- is trained on a slightly different subset of the data
- looks at a different subset of features
- makes its own prediction
The final prediction is obtained by averaging the predictions of all trees.
Why using many trees helps¶
Individual decision trees:
- are easy to understand
- but tend to overfit the data
Random Forest reduces this problem by:
- training many trees
- making them as different as possible
- combining their predictions
Errors made by individual trees tend to cancel out when averaged.
Key takeaway¶
Random Forest Regression does not try to find a single rule.
Instead, it asks: "What would many different decision trees predict?"
By combining these answers, it produces robust and flexible predictions on complex regression problems.
Important hyperparameters (introduced, not tuned)¶
At this stage we focus on understanding the model, not on optimizing it.
We start with a simple configuration and default values. Hyperparameter tuning can be explored later.
from sklearn.ensemble import RandomForestRegressor
# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(
n_estimators=100,
random_state=42,
n_jobs=-1
)
# Train the model
rf_model.fit(X_train_scaled, y_train)
RandomForestRegressor(n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
What these parameters mean¶
n_estimators=100
Number of decision trees in the forest.
More trees usually improve stability, but increase computation.random_state=42
Ensures reproducibility of results.n_jobs=-1
Uses all available CPU cores to speed up training.
What we have after training¶
After this step:
- multiple decision trees have been trained
- each tree has learned different patterns
- the forest is ready to make predictions
Unlike KNN:
- training is computationally heavier
- prediction is relatively fast
____________________________________¶
7. Model behavior and key hyperparameters (Random Forest Regression)¶
In this section we describe how Random Forest behaves and which hyperparameters most strongly influence its predictions.
Random Forest does not produce simple, interpretable parameters, but its behavior can still be understood at a high level.
Number of trees (n_estimators)¶
The number of trees controls:
- model stability
- variance reduction
General behavior:
- few trees → higher variance, less stable predictions
- many trees → more stable, smoother predictions
Beyond a certain point, adding more trees provides diminishing returns.
Tree depth and complexity¶
Each tree in the forest can grow deep and complex.
Key parameters that control tree complexity include:
- maximum tree depth
- minimum number of samples per split
- minimum number of samples per leaf
More complex trees:
- capture fine details
- risk overfitting
Simpler trees:
- generalize better
- may underfit
Key takeaway¶
Random Forest behavior is controlled by:
- how many trees are used
- how complex each tree is
- how much randomness is introduced
Understanding these elements helps explain why Random Forest often performs well without extensive tuning.
# Generate predictions on the test set
y_pred_rf = rf_model.predict(X_test_scaled)
What we obtained¶
y_pred_rfcontains the predicted target values- predictions are generally smooth and stable
- extreme values are handled more robustly than with simpler models
____________________________________¶
9. Model evaluation (Random Forest Regression)¶
In this section we evaluate the performance of the Random Forest Regression model by comparing its predictions with the true target values.
Using the same evaluation metrics across models allows direct and fair comparison.
# Compute evaluation metrics for Random Forest Regression
mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_rf)
mae, mse, rmse, r2
(0.3274252027374033, 0.255169737347244, np.float64(0.5051432839771741), 0.8052747336256919)
Metrics interpretation¶
MAE (Mean Absolute Error)
Average absolute difference between predictions and true values.MSE (Mean Squared Error)
Penalizes large errors more heavily.RMSE (Root Mean Squared Error)
Expresses prediction error in the same units as the target variable and is often the most informative metric for model comparison.R² score
Indicates how much variance in the target variable is explained by the model.
What to expect from Random Forest results¶
With Random Forest Regression, you will often observe:
- lower RMSE compared to simpler models
- better handling of non-linear patterns
- improved robustness to noise and outliers
However:
- improvements come at the cost of interpretability
- training time and memory usage are higher
When Random Forest Regression is a good choice¶
Random Forest Regression works well when:
- The relationship between features and target is non-linear
- The data contains complex interactions between features
- The dataset is of small to medium size
- Robust performance is more important than interpretability
- You want strong results without extensive feature engineering
In many real-world problems, Random Forest provides a strong balance between accuracy and reliability.
When Random Forest Regression is NOT a good choice¶
Random Forest Regression may not be ideal when:
- Interpretability is a strict requirement
- The dataset is extremely large
- Memory usage is a concern
- Very fast prediction time is required
- A simple linear relationship already explains the data well
In these cases, simpler or more specialized models may be preferable.
Typical warning signs¶
You should be cautious if:
- Training time becomes very long
- The model uses excessive memory
- Performance gains over simpler models are marginal
- Model behavior is difficult to explain to stakeholders
These signals suggest that Random Forest may not be the most efficient choice.
Why saving the model is important¶
Training a Random Forest can be computationally expensive, especially when many trees are used.
Once the model has been trained and evaluated, it is common practice to save it and reuse it later:
- in another notebook
- in an application
- in a production environment
Important rule: save the scaler together with the model¶
Even though Random Forest does not require feature scaling, we still save the scaler.
This ensures that:
- the same preprocessing pipeline is applied
- models can be swapped without changing the input pipeline
- results remain consistent across experiments
# Define model directory
model_dir = Path("models/supervised_learning/regression/random_forest_regression")
# Create directory if it does not exist
model_dir.mkdir(parents=True, exist_ok=True)
# Save model and scaler
joblib.dump(rf_model, model_dir / "random_forest_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")
['models\\supervised_learning\\regression\\random_forest_regression\\scaler.joblib']
Loading the model later (conceptual example)¶
To reuse the model:
- load the scaler
- scale new input data
- load the Random Forest model
- generate predictions
This guarantees consistency with the original training pipeline.
From decision trees to Random Forest¶
Random Forest Regression is built on top of decision trees.
A single decision tree:
- recursively splits the feature space
- creates regions where predictions are constant
- predicts the average target value in each region
While individual trees are simple, they tend to overfit the training data.
Introducing randomness¶
Random Forest reduces overfitting by introducing randomness in two main ways:
Bootstrap sampling
Each tree is trained on a random subset of the training data sampled with replacement.Feature subsampling
At each split, only a random subset of features is considered.
These two sources of randomness make individual trees less correlated.
Ensemble prediction¶
Let each tree produce a prediction ŷᵢ(x) for an input x.
The Random Forest prediction is computed as:
- the average of all tree predictions
This averaging process:
- reduces variance
- stabilizes predictions
- improves generalization
Bias–variance perspective¶
Random Forest primarily addresses variance.
- Individual trees have low bias but high variance
- Averaging many trees keeps bias low
- Variance is reduced through aggregation
This explains why Random Forest often performs well without heavy tuning.
Why scaling does not affect the math¶
Decision trees split data based on feature thresholds.
Because these splits depend on order, not magnitude:
- feature scaling does not change split decisions
- the mathematical behavior of the model remains unchanged
This explains why Random Forest does not require scaling, even though we include it for pipeline consistency.
Final takeaway¶
Random Forest Regression replaces a single complex model with an ensemble of simple ones.
By combining:
- randomness
- averaging
- independent decision trees
it achieves strong performance on complex, non-linear problems, while remaining conceptually grounded and robust.
____________________________________¶
Final summary – Code only¶
The following cell contains the complete pipeline from data loading to model persistence.
No explanations are provided here on purpose.
This section is intended for:
- quick execution
- reference
- reuse in scripts or applications
If you want to understand what each step does and why, read the notebook from top to bottom.
# ====================================
# Imports
# ====================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from pathlib import Path
import joblib
# ====================================
# Dataset loading
# ====================================
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
# ====================================
# Train-test split
# ====================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
# ====================================
# Feature scaling (pipeline consistency)
# ====================================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ====================================
# Model initialization
# ====================================
rf_model = RandomForestRegressor(
n_estimators=100,
random_state=42,
n_jobs=-1
)
# ====================================
# Model training
# ====================================
rf_model.fit(X_train_scaled, y_train)
# ====================================
# Predictions
# ====================================
y_pred_rf = rf_model.predict(X_test_scaled)
# ====================================
# Model evaluation
# ====================================
mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_rf)
mae, mse, rmse, r2
# ====================================
# Model persistence
# ====================================
model_dir = Path("models/supervised_learning/regression/random_forest_regression")
model_dir.mkdir(parents=True, exist_ok=True)
joblib.dump(rf_model, model_dir / "random_forest_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")