Supervised Learning → Random Forest Regression¶

As with other regression models in this project, the first sections focus on data preparation and are intentionally repeated.

This ensures that each notebook can be read and used independently, without external references.

  1. Project setup and common pipeline
  2. Dataset loading
  3. Train-test split
  4. Feature scaling (why we do it)

  1. What is this model? (Intuition)
  2. Model training
  3. Model behavior and key hyperparameters
  4. Predictions
  5. Model evaluation
  6. When to use it and when not to
  7. Model persistence
  8. Mathematical formulation (deep dive)
  9. Final summary – Code only

How this notebook should be read¶

This notebook is designed to be read top to bottom.

Before every code cell, you will find a short explanation describing:

  • what we are about to do
  • why this step is necessary
  • how it fits into the overall process

The goal is not just to run the code, but to understand what is happening at each step and be able to adapt it to your own data.


What is Random Forest Regression?¶

Random Forest Regression is a powerful and flexible model that works very differently from both Linear Regression and KNN.

Instead of learning a single global rule or relying on nearby data points, Random Forest builds many decision trees and combines their predictions.

Each tree:

  • looks at the data in a slightly different way
  • makes its own prediction

The final prediction is obtained by:

  • averaging the predictions of all trees

By combining many simple models, Random Forest is able to capture complex and non-linear patterns that simpler models cannot.


Why we start with intuition¶

Random Forest can look complex at first, but the core idea is simple.

Instead of trusting a single model, Random Forest asks the same question many times, each time with a slightly different perspective.

Each tree may make mistakes, but when many trees agree, the final prediction becomes more reliable.

Understanding this idea of many weak models working together is key to understanding how Random Forest works.


What you should expect from the results¶

Before using Random Forest Regression, it is important to set expectations.

With Random Forest Regression, you should expect:

  • strong performance on complex and non-linear data
  • robust predictions even in the presence of noise
  • less sensitivity to individual outliers

Compared to simpler models:

  • Random Forest usually outperforms Linear Regression
  • it is more stable than KNN on large datasets
  • it requires less manual feature engineering

However, this power comes at a cost:

  • the model is harder to interpret
  • training and prediction are more computationally expensive

____________________________________¶

1. Project setup and common pipeline¶

In this section we set up the common pipeline used across all regression models in this project.

This part of the notebook does not depend on the model itself and is intentionally kept consistent to:

  • ensure fair comparison between models
  • reduce implementation errors
  • focus on understanding model behavior
In [11]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score
)

from pathlib import Path
import joblib

Note on feature scaling for Random Forest¶

Random Forest does not rely on distances or linear combinations of features.

For this reason, feature scaling is not required for the model to work correctly.

However, we keep feature scaling in the pipeline to:

  • maintain consistency across all regression notebooks
  • simplify comparisons between models
  • avoid changing preprocessing steps when switching models

____________________________________¶

2. Dataset loading¶

In this section we load the dataset that will be used to train and evaluate the Random Forest Regression model.

We use the same dataset as in the other regression notebooks to allow direct comparison between different models.

In [12]:
# Load the dataset

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

Inputs and target¶

  • X contains the input features
  • y contains the continuous target variable

At this stage:

  • no modeling is performed
  • we are only defining what the model will learn from

____________________________________¶

3. Train-test split¶

In this section we split the dataset into training and test sets.

This allows us to evaluate how well the model generalizes to unseen data.

In [13]:
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Why this step is important¶

A model should not be evaluated on the same data it was trained on.

By separating the data:

  • the training set is used to build the model
  • the test set is used only for evaluation

This ensures that performance metrics reflect real-world behavior.

Consistency across models¶

We use the same split configuration as in the other regression notebooks.

This guarantees that performance differences are due to the model itself and not to different data splits.

____________________________________¶

4. Feature scaling (pipeline consistency)¶

In this section we apply feature scaling to the input features.

Although Random Forest does not rely on distances or linear combinations of features, we keep feature scaling in the pipeline to maintain consistency across models.

Is feature scaling required for Random Forest?¶

Strictly speaking:

  • Random Forest does not require feature scaling

Decision trees:

  • split features based on thresholds
  • are invariant to feature scale

However, scaling is still applied here for consistency and comparability.

Why we keep the same pipeline across models¶

In practice, once a problem is identified as a regression task, it is common to try multiple models and compare their performance on the same data.

For this reason:

  • the dataset remains identical
  • the train-test split remains identical
  • the preprocessing pipeline remains identical

Keeping the same pipeline allows us to:

  • compare models fairly
  • attribute performance differences to the model itself
  • switch models without changing data preparation steps

Even if a model does not strictly require scaling, including it in the pipeline ensures consistency and simplifies model comparison.

Why we still apply scaling¶

We apply feature scaling to:

  • keep the preprocessing pipeline identical
  • simplify switching between models
  • avoid conditional logic in the code

This makes the project easier to maintain and easier to extend.

Important rule: fit only on training data¶

As with other preprocessing steps:

  • the scaler is fitted on training data only
  • the same scaler is applied to test data

This prevents data leakage and ensures a fair evaluation.

In [14]:
# Feature scaling (kept for pipeline consistency)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

What we have after this step¶

  • scaled training data
  • scaled test data
  • a complete and consistent preprocessing pipeline

At this point, the data is ready to be used by the Random Forest model.

____________________________________¶

5. What is this model? (Random Forest Regression)¶

Before training the model, it is important to understand what Random Forest Regression is trying to do conceptually.

Random Forest Regression is an ensemble model, meaning it combines the predictions of many simpler models to produce a final result.

The core idea¶

Instead of relying on a single model, Random Forest builds many decision trees.

Each tree:

  • is trained on a slightly different subset of the data
  • looks at a different subset of features
  • makes its own prediction

The final prediction is obtained by averaging the predictions of all trees.

Why using many trees helps¶

Individual decision trees:

  • are easy to understand
  • but tend to overfit the data

Random Forest reduces this problem by:

  • training many trees
  • making them as different as possible
  • combining their predictions

Errors made by individual trees tend to cancel out when averaged.

Key takeaway¶

Random Forest Regression does not try to find a single rule.

Instead, it asks: "What would many different decision trees predict?"

By combining these answers, it produces robust and flexible predictions on complex regression problems.

____________________________________¶

6. Model training (Random Forest Regression)¶

In this section we train the Random Forest Regression model.

Unlike KNN, Random Forest performs real training: it builds multiple decision trees and learns patterns from the data.

Important hyperparameters (introduced, not tuned)¶

At this stage we focus on understanding the model, not on optimizing it.

We start with a simple configuration and default values. Hyperparameter tuning can be explored later.

In [15]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

# Train the model
rf_model.fit(X_train_scaled, y_train)
Out[15]:
RandomForestRegressor(n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_estimators n_estimators: int, default=100

The number of trees in the forest.

.. versionchanged:: 0.22
The default value of ``n_estimators`` changed from 10 to 100
in 0.22.
100
criterion criterion: {"squared_error", "absolute_error", "friedman_mse", "poisson"}, default="squared_error"

The function to measure the quality of a split. Supported criteria
are "squared_error" for the mean squared error, which is equal to
variance reduction as feature selection criterion and minimizes the L2
loss using the mean of each terminal node, "friedman_mse", which uses
mean squared error with Friedman's improvement score for potential
splits, "absolute_error" for the mean absolute error, which minimizes
the L1 loss using the median of each terminal node, and "poisson" which
uses reduction in Poisson deviance to find splits.
Training using "absolute_error" is significantly slower
than when using "squared_error".

.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.

.. versionadded:: 1.0
Poisson criterion.
'squared_error'
max_depth max_depth: int, default=None

The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
None
min_samples_split min_samples_split: int or float, default=2

The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a fraction and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

.. versionchanged:: 0.18
Added float values for fractions.
2
min_samples_leaf min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node.
A split point at any depth will only be considered if it leaves at
least ``min_samples_leaf`` training samples in each of the left and
right branches. This may have the effect of smoothing the model,
especially in regression.

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a fraction and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

.. versionchanged:: 0.18
Added float values for fractions.
1
min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0

The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
0.0
max_features max_features: {"sqrt", "log2", None}, int or float, default=1.0

The number of features to consider when looking for the best split:

- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a fraction and
`max(1, int(max_features * n_features_in_))` features are considered at each
split.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None or 1.0, then `max_features=n_features`.

.. note::
The default of 1.0 is equivalent to bagged trees and more
randomness can be achieved by setting smaller values, e.g. 0.3.

.. versionchanged:: 1.1
The default of `max_features` changed from `"auto"` to 1.0.

Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than ``max_features`` features.
1.0
max_leaf_nodes max_leaf_nodes: int, default=None

Grow trees with ``max_leaf_nodes`` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
None
min_impurity_decrease min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.

The weighted impurity decrease equation is the following::

N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)

where ``N`` is the total number of samples, ``N_t`` is the number of
samples at the current node, ``N_t_L`` is the number of samples in the
left child, and ``N_t_R`` is the number of samples in the right child.

``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
if ``sample_weight`` is passed.

.. versionadded:: 0.19
0.0
bootstrap bootstrap: bool, default=True

Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
True
oob_score oob_score: bool or callable, default=False

Whether to use out-of-bag samples to estimate the generalization score.
By default, :func:`~sklearn.metrics.r2_score` is used.
Provide a callable with signature `metric(y_true, y_pred)` to use a
custom metric. Only available if `bootstrap=True`.

For an illustration of out-of-bag (OOB) error estimation, see the example
:ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.
False
n_jobs n_jobs: int, default=None

The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`,
:meth:`decision_path` and :meth:`apply` are all parallelized over the
trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
context. ``-1`` means using all processors. See :term:`Glossary
` for more details.
-1
random_state random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used
when building trees (if ``bootstrap=True``) and the sampling of the
features to consider when looking for the best split at each node
(if ``max_features < n_features``).
See :term:`Glossary ` for details.
42
verbose verbose: int, default=0

Controls the verbosity when fitting and predicting.
0
warm_start warm_start: bool, default=False

When set to ``True``, reuse the solution of the previous call to fit
and add more estimators to the ensemble, otherwise, just fit a whole
new forest. See :term:`Glossary ` and
:ref:`tree_ensemble_warm_start` for details.
False
ccp_alpha ccp_alpha: non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The
subtree with the largest cost complexity that is smaller than
``ccp_alpha`` will be chosen. By default, no pruning is performed. See
:ref:`minimal_cost_complexity_pruning` for details. See
:ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
for an example of such pruning.

.. versionadded:: 0.22
0.0
max_samples max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X
to train each base estimator.

- If None (default), then draw `X.shape[0]` samples.
- If int, then draw `max_samples` samples.
- If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus,
`max_samples` should be in the interval `(0.0, 1.0]`.

.. versionadded:: 0.22
None
monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None

Indicates the monotonicity constraint to enforce on each feature.
- 1: monotonically increasing
- 0: no constraint
- -1: monotonically decreasing

If monotonic_cst is None, no constraints are applied.

Monotonicity constraints are not supported for:
- multioutput regressions (i.e. when `n_outputs_ > 1`),
- regressions trained on data with missing values.

Read more in the :ref:`User Guide `.

.. versionadded:: 1.4
None

What these parameters mean¶

  • n_estimators=100
    Number of decision trees in the forest.
    More trees usually improve stability, but increase computation.

  • random_state=42
    Ensures reproducibility of results.

  • n_jobs=-1
    Uses all available CPU cores to speed up training.

What we have after training¶

After this step:

  • multiple decision trees have been trained
  • each tree has learned different patterns
  • the forest is ready to make predictions

Unlike KNN:

  • training is computationally heavier
  • prediction is relatively fast

____________________________________¶

7. Model behavior and key hyperparameters (Random Forest Regression)¶

In this section we describe how Random Forest behaves and which hyperparameters most strongly influence its predictions.

Random Forest does not produce simple, interpretable parameters, but its behavior can still be understood at a high level.

Number of trees (n_estimators)¶

The number of trees controls:

  • model stability
  • variance reduction

General behavior:

  • few trees → higher variance, less stable predictions
  • many trees → more stable, smoother predictions

Beyond a certain point, adding more trees provides diminishing returns.

Tree depth and complexity¶

Each tree in the forest can grow deep and complex.

Key parameters that control tree complexity include:

  • maximum tree depth
  • minimum number of samples per split
  • minimum number of samples per leaf

More complex trees:

  • capture fine details
  • risk overfitting

Simpler trees:

  • generalize better
  • may underfit

Key takeaway¶

Random Forest behavior is controlled by:

  • how many trees are used
  • how complex each tree is
  • how much randomness is introduced

Understanding these elements helps explain why Random Forest often performs well without extensive tuning.

____________________________________¶

8. Predictions (Random Forest Regression)¶

In this section we use the trained Random Forest model to generate predictions on unseen data.

This step shows how the model behaves in practice after the training phase is complete.

In [16]:
# Generate predictions on the test set

y_pred_rf = rf_model.predict(X_test_scaled)

What we obtained¶

  • y_pred_rf contains the predicted target values
  • predictions are generally smooth and stable
  • extreme values are handled more robustly than with simpler models

____________________________________¶

9. Model evaluation (Random Forest Regression)¶

In this section we evaluate the performance of the Random Forest Regression model by comparing its predictions with the true target values.

Using the same evaluation metrics across models allows direct and fair comparison.

In [17]:
# Compute evaluation metrics for Random Forest Regression

mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_rf)

mae, mse, rmse, r2
Out[17]:
(0.3274252027374033,
 0.255169737347244,
 np.float64(0.5051432839771741),
 0.8052747336256919)

Metrics interpretation¶

  • MAE (Mean Absolute Error)
    Average absolute difference between predictions and true values.

  • MSE (Mean Squared Error)
    Penalizes large errors more heavily.

  • RMSE (Root Mean Squared Error)
    Expresses prediction error in the same units as the target variable and is often the most informative metric for model comparison.

  • R² score
    Indicates how much variance in the target variable is explained by the model.

What to expect from Random Forest results¶

With Random Forest Regression, you will often observe:

  • lower RMSE compared to simpler models
  • better handling of non-linear patterns
  • improved robustness to noise and outliers

However:

  • improvements come at the cost of interpretability
  • training time and memory usage are higher

____________________________________¶

10. When to use it and when not to (Random Forest Regression)¶

Random Forest Regression is often a strong default choice, but it is not always the best solution.

Understanding when to use it helps avoid unnecessary complexity and inefficiency.

When Random Forest Regression is a good choice¶

Random Forest Regression works well when:

  • The relationship between features and target is non-linear
  • The data contains complex interactions between features
  • The dataset is of small to medium size
  • Robust performance is more important than interpretability
  • You want strong results without extensive feature engineering

In many real-world problems, Random Forest provides a strong balance between accuracy and reliability.

When Random Forest Regression is NOT a good choice¶

Random Forest Regression may not be ideal when:

  • Interpretability is a strict requirement
  • The dataset is extremely large
  • Memory usage is a concern
  • Very fast prediction time is required
  • A simple linear relationship already explains the data well

In these cases, simpler or more specialized models may be preferable.

Typical warning signs¶

You should be cautious if:

  • Training time becomes very long
  • The model uses excessive memory
  • Performance gains over simpler models are marginal
  • Model behavior is difficult to explain to stakeholders

These signals suggest that Random Forest may not be the most efficient choice.

____________________________________¶

11. Model persistence (Random Forest Regression)¶

In this section we save the trained Random Forest model and the preprocessing steps used during training.

Saving the model allows us to reuse it without retraining and ensures reproducibility.

Why saving the model is important¶

Training a Random Forest can be computationally expensive, especially when many trees are used.

Once the model has been trained and evaluated, it is common practice to save it and reuse it later:

  • in another notebook
  • in an application
  • in a production environment

Important rule: save the scaler together with the model¶

Even though Random Forest does not require feature scaling, we still save the scaler.

This ensures that:

  • the same preprocessing pipeline is applied
  • models can be swapped without changing the input pipeline
  • results remain consistent across experiments
In [18]:
# Define model directory
model_dir = Path("models/supervised_learning/regression/random_forest_regression")

# Create directory if it does not exist
model_dir.mkdir(parents=True, exist_ok=True)

# Save model and scaler
joblib.dump(rf_model, model_dir / "random_forest_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")
Out[18]:
['models\\supervised_learning\\regression\\random_forest_regression\\scaler.joblib']

Loading the model later (conceptual example)¶

To reuse the model:

  • load the scaler
  • scale new input data
  • load the Random Forest model
  • generate predictions

This guarantees consistency with the original training pipeline.

____________________________________¶

12. Mathematical formulation (deep dive)¶

This section provides a deeper explanation of Random Forest Regression from a mathematical and algorithmic perspective.

The goal is to understand the principles behind the model, not to derive every formula in detail.

From decision trees to Random Forest¶

Random Forest Regression is built on top of decision trees.

A single decision tree:

  • recursively splits the feature space
  • creates regions where predictions are constant
  • predicts the average target value in each region

While individual trees are simple, they tend to overfit the training data.

Introducing randomness¶

Random Forest reduces overfitting by introducing randomness in two main ways:

  1. Bootstrap sampling
    Each tree is trained on a random subset of the training data sampled with replacement.

  2. Feature subsampling
    At each split, only a random subset of features is considered.

These two sources of randomness make individual trees less correlated.

Ensemble prediction¶

Let each tree produce a prediction ŷᵢ(x) for an input x.

The Random Forest prediction is computed as:

  • the average of all tree predictions

This averaging process:

  • reduces variance
  • stabilizes predictions
  • improves generalization

Bias–variance perspective¶

Random Forest primarily addresses variance.

  • Individual trees have low bias but high variance
  • Averaging many trees keeps bias low
  • Variance is reduced through aggregation

This explains why Random Forest often performs well without heavy tuning.

Why scaling does not affect the math¶

Decision trees split data based on feature thresholds.

Because these splits depend on order, not magnitude:

  • feature scaling does not change split decisions
  • the mathematical behavior of the model remains unchanged

This explains why Random Forest does not require scaling, even though we include it for pipeline consistency.

Final takeaway¶

Random Forest Regression replaces a single complex model with an ensemble of simple ones.

By combining:

  • randomness
  • averaging
  • independent decision trees

it achieves strong performance on complex, non-linear problems, while remaining conceptually grounded and robust.

____________________________________¶

Final summary – Code only¶

The following cell contains the complete pipeline from data loading to model persistence.

No explanations are provided here on purpose.

This section is intended for:

  • quick execution
  • reference
  • reuse in scripts or applications

If you want to understand what each step does and why, read the notebook from top to bottom.

In [ ]:
# ====================================
# Imports
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from pathlib import Path
import joblib


# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


# ====================================
# Feature scaling (pipeline consistency)
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# ====================================
# Model initialization
# ====================================

rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)


# ====================================
# Model training
# ====================================

rf_model.fit(X_train_scaled, y_train)


# ====================================
# Predictions
# ====================================

y_pred_rf = rf_model.predict(X_test_scaled)


# ====================================
# Model evaluation
# ====================================

mae = mean_absolute_error(y_test, y_pred_rf)
mse = mean_squared_error(y_test, y_pred_rf)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_rf)

mae, mse, rmse, r2


# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/random_forest_regression")
model_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(rf_model, model_dir / "random_forest_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")