Deep Learning – Regression (scikit-learn)¶

This notebook is part of the ML-Methods project.

It introduces Deep Learning for supervised regression using the scikit-learn implementation of neural networks.

As with all other notebooks in this project, the initial sections focus on data preparation and are intentionally repeated.

This ensures:

  • conceptual consistency
  • fair comparison across models
  • a unified learning pipeline

Notebook Roadmap (standard ML-Methods)¶

  1. Project setup and common pipeline
  2. Dataset loading
  3. Train-test split
  4. Feature scaling (why we do it)

  1. What is this model? (Intuition)
  2. Model training
  3. Model behavior and key parameters
  4. Predictions
  5. Model evaluation
  6. When to use it and when not to
  7. Model persistence
  8. Mathematical formulation (deep dive)
  9. Final summary – Code only

How this notebook should be read¶

This notebook is designed to be read top to bottom.

Before every code cell, you will find a short explanation describing:

  • what we are about to do
  • why this step is necessary
  • how it fits into the overall process

The goal is not just to run the code, but to understand how deep learning regression differs from classical regression models and how it fits into the supervised learning pipeline.


What is Deep Learning (in this context)?¶

In this notebook, Deep Learning refers to neural networks with multiple layers used to solve regression problems.

Unlike classification:

  • the target is a continuous value
  • there are no class labels
  • the model predicts a real number

The neural network learns a function:

input features → continuous output


What do we want to achieve?¶

Our objective is to train a model that:

  • receives a vector of numerical features
  • processes them through multiple layers
  • outputs a single continuous value

The model learns how combinations of input features map to a numerical target.

This is useful when:

  • relationships are non-linear
  • classical linear regression is insufficient
  • feature interactions are complex

Why use scikit-learn for Deep Learning regression?¶

scikit-learn provides a high-level abstraction for neural networks through MLPRegressor.

This allows us to:

  • reuse the same ML pipeline as classical models
  • focus on concepts rather than low-level training details
  • understand what deep learning regression does before implementing it manually

This notebook acts as a bridge between classical regression and full deep learning frameworks such as PyTorch and TensorFlow.


What you should expect from the results¶

With Deep Learning (scikit-learn regression), you should expect:

  • ability to model non-linear relationships
  • improved performance on complex patterns
  • sensitivity to feature scaling
  • longer training times than linear models

However:

  • interpretability is low
  • hyperparameter tuning is important
  • the model behaves as a black box

1. Project setup and common pipeline¶

In this section we set up the common pipeline used across regression models in this project.

Although this notebook uses a neural network, the data preparation steps remain identical to other regression approaches.

In [2]:
# ====================================
# Common imports used across regression models
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt

What changes compared to classical regression¶

Compared to linear regression:

  • the target remains continuous
  • the pipeline remains identical
  • the model becomes non-linear

The main difference lies in the model itself, not in the surrounding workflow.

In the next section, we will load the dataset used for the regression task.


2. Dataset loading¶

In this section we load the dataset used for the deep learning regression task.

We use a regression-specific dataset with a continuous target variable, which allows us to evaluate how neural networks behave when predicting real-valued outputs.

In [3]:
# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

Inputs and target¶

  • X contains the input features
  • y contains the target variable

This is a supervised regression problem:

  • each sample has a continuous target value
  • the goal is to predict a real number, not a class

Why this dataset¶

The California Housing dataset is well suited for regression because:

  • it contains numerical features
  • relationships are non-linear
  • target values are continuous

This makes it a good benchmark for comparing classical regression and deep learning regression models.

At this stage:

  • data is still in pandas format
  • no preprocessing has been applied yet

In the next section, we will split the dataset into training and test sets.


3. Train-test split¶

In this section we split the dataset into training and test sets.

This step allows us to evaluate how well the neural network regressor generalizes to unseen data.

In [4]:
# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Why this step is essential¶

A regression model must be evaluated on data it has never seen during training.

By splitting the data:

  • the training set is used to learn the mapping from features to target values
  • the test set is used only for evaluation

This prevents overly optimistic results and reflects real-world performance.

Choice of split ratio¶

An 80 / 20 split is a common default:

  • enough data to train the model
  • enough data to reliably evaluate predictions

At this point:

  • training and test data are separated
  • no preprocessing has been applied yet

In the next section, we will apply feature scaling.

For deep learning regression, this step is mandatory.


4. Feature scaling (why we do it)¶

In this section we apply feature scaling to the input data.

For deep learning regression models, feature scaling is mandatory.

Neural networks are trained using gradient-based optimization, which is highly sensitive to the scale of input features.

In [5]:
# ====================================
# Feature scaling
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Why we use standardization here¶

We use standardization for feature scaling because neural networks rely on gradients to update their parameters.

Standardization:

  • centers features around zero
  • ensures comparable variance across features
  • improves numerical stability during training

This helps:

  • gradients behave more predictably
  • optimization converge faster
  • training remain stable across layers

At this stage:

  • data is numerically ready
  • still in NumPy format

In the next section, we will explain what this model is and how a neural network performs regression using scikit-learn.


5. What is this model? (Deep Learning Regression)¶

Before training the model, it is important to understand what a neural network does when it is used for regression.

Unlike classification, the goal is not to assign a class label, but to predict a continuous numerical value.

What problem are we solving?¶

In a regression problem:

  • each input sample is described by multiple features
  • the target is a real number
  • predictions can take infinitely many values

The model learns a function:

input features → continuous output

The objective is to make predictions as close as possible to the true values.

How does a neural network perform regression?¶

A neural network performs regression by:

  1. Receiving a vector of input features
  2. Combining features through weighted sums
  3. Applying non-linear transformations
  4. Producing a single numerical output

Each layer transforms the input into a representation that is more informative for prediction.

What is different from linear regression?¶

Linear regression assumes:

  • a linear relationship between inputs and target
  • a single global equation

Neural network regression:

  • does not assume linearity
  • learns complex, non-linear relationships
  • adapts to feature interactions automatically

This makes neural networks more expressive than linear models.

Why no activation in the output layer?¶

In regression:

  • the output represents a real value
  • there is no notion of class probability

For this reason:

  • the output layer uses a linear activation
  • the model can predict any real number

Non-linear activations are used only in the hidden layers.

How learning happens conceptually¶

Learning follows an iterative process:

  1. The model makes a numerical prediction
  2. The prediction is compared to the true value
  3. An error is computed
  4. Model parameters are adjusted
  5. The process repeats

Over time, the model reduces the prediction error and improves accuracy.

Why neural networks can overfit in regression¶

Neural networks have high expressive power.

This means:

  • they can fit training data extremely well
  • they may capture noise instead of structure

Overfitting occurs when:

  • training error becomes very small
  • test error stops improving or increases

This behavior is common in deep learning regression models, especially on small datasets.

Key takeaway¶

Deep Learning regression models:

  • predict continuous values
  • learn non-linear feature relationships
  • require careful scaling and evaluation

They are powerful tools when classical regression models are not expressive enough.

In the next section, we will train the neural network regressor using scikit-learn.


6. Model training (Deep Learning Regression)¶

In this section we train a neural network regressor using scikit-learn.

Unlike classical regression models, this model performs real training: it iteratively adjusts internal parameters to minimize prediction error.

In [6]:
# ====================================
# Model initialization
# ====================================

mlp_regressor = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    max_iter=500,
    random_state=42
)
In [9]:
# ====================================
# Model training
# ====================================

mlp_regressor.fit(X_train_scaled, y_train)
Out[9]:
MLPRegressor(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
loss loss: {'squared_error', 'poisson'}, default='squared_error'

The loss function to use when training the weights. Note that the
"squared error" and "poisson" losses actually implement
"half squares error" and "half poisson deviance" to simplify the
computation of the gradient. Furthermore, the "poisson" loss internally uses
a log-link (exponential as the output activation function) and requires
``y >= 0``.

.. versionchanged:: 1.7
Added parameter `loss` and option 'poisson'.
'squared_error'
hidden_layer_sizes hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,)

The ith element represents the number of neurons in the ith
hidden layer.
(64, ...)
activation activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu'

Activation function for the hidden layer.

- 'identity', no-op activation, useful to implement linear bottleneck,
returns f(x) = x

- 'logistic', the logistic sigmoid function,
returns f(x) = 1 / (1 + exp(-x)).

- 'tanh', the hyperbolic tan function,
returns f(x) = tanh(x).

- 'relu', the rectified linear unit function,
returns f(x) = max(0, x)
'relu'
solver solver: {'lbfgs', 'sgd', 'adam'}, default='adam'

The solver for weight optimization.

- 'lbfgs' is an optimizer in the family of quasi-Newton methods.

- 'sgd' refers to stochastic gradient descent.

- 'adam' refers to a stochastic gradient-based optimizer proposed by
Kingma, Diederik, and Jimmy Ba

For a comparison between Adam optimizer and SGD, see
:ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`.

Note: The default solver 'adam' works pretty well on relatively
large datasets (with thousands of training samples or more) in terms of
both training time and validation score.
For small datasets, however, 'lbfgs' can converge faster and perform
better.
'adam'
alpha alpha: float, default=0.0001

Strength of the L2 regularization term. The L2 regularization term
is divided by the sample size when added to the loss.
0.0001
batch_size batch_size: int, default='auto'

Size of minibatches for stochastic optimizers.
If the solver is 'lbfgs', the regressor will not use minibatch.
When set to "auto", `batch_size=min(200, n_samples)`.
'auto'
learning_rate learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant'

Learning rate schedule for weight updates.

- 'constant' is a constant learning rate given by
'learning_rate_init'.

- 'invscaling' gradually decreases the learning rate ``learning_rate_``
at each time step 't' using an inverse scaling exponent of 'power_t'.
effective_learning_rate = learning_rate_init / pow(t, power_t)

- 'adaptive' keeps the learning rate constant to
'learning_rate_init' as long as training loss keeps decreasing.
Each time two consecutive epochs fail to decrease training loss by at
least tol, or fail to increase validation score by at least tol if
'early_stopping' is on, the current learning rate is divided by 5.

Only used when solver='sgd'.
'constant'
learning_rate_init learning_rate_init: float, default=0.001

The initial learning rate used. It controls the step-size
in updating the weights. Only used when solver='sgd' or 'adam'.
0.001
power_t power_t: float, default=0.5

The exponent for inverse scaling learning rate.
It is used in updating effective learning rate when the learning_rate
is set to 'invscaling'. Only used when solver='sgd'.
0.5
max_iter max_iter: int, default=200

Maximum number of iterations. The solver iterates until convergence
(determined by 'tol') or this number of iterations. For stochastic
solvers ('sgd', 'adam'), note that this determines the number of epochs
(how many times each data point will be used), not the number of
gradient steps.
500
shuffle shuffle: bool, default=True

Whether to shuffle samples in each iteration. Only used when
solver='sgd' or 'adam'.
True
random_state random_state: int, RandomState instance, default=None

Determines random number generation for weights and bias
initialization, train-test split if early stopping is used, and batch
sampling when solver='sgd' or 'adam'.
Pass an int for reproducible results across multiple function calls.
See :term:`Glossary `.
42
tol tol: float, default=1e-4

Tolerance for the optimization. When the loss or score is not improving
by at least ``tol`` for ``n_iter_no_change`` consecutive iterations,
unless ``learning_rate`` is set to 'adaptive', convergence is
considered to be reached and training stops.
0.0001
verbose verbose: bool, default=False

Whether to print progress messages to stdout.
False
warm_start warm_start: bool, default=False

When set to True, reuse the solution of the previous
call to fit as initialization, otherwise, just erase the
previous solution. See :term:`the Glossary `.
False
momentum momentum: float, default=0.9

Momentum for gradient descent update. Should be between 0 and 1. Only
used when solver='sgd'.
0.9
nesterovs_momentum nesterovs_momentum: bool, default=True

Whether to use Nesterov's momentum. Only used when solver='sgd' and
momentum > 0.
True
early_stopping early_stopping: bool, default=False

Whether to use early stopping to terminate training when validation
score is not improving. If set to True, it will automatically set
aside ``validation_fraction`` of training data as validation and
terminate training when validation score is not improving by at
least ``tol`` for ``n_iter_no_change`` consecutive epochs.
Only effective when solver='sgd' or 'adam'.
False
validation_fraction validation_fraction: float, default=0.1

The proportion of training data to set aside as validation set for
early stopping. Must be between 0 and 1.
Only used if early_stopping is True.
0.1
beta_1 beta_1: float, default=0.9

Exponential decay rate for estimates of first moment vector in adam,
should be in [0, 1). Only used when solver='adam'.
0.9
beta_2 beta_2: float, default=0.999

Exponential decay rate for estimates of second moment vector in adam,
should be in [0, 1). Only used when solver='adam'.
0.999
epsilon epsilon: float, default=1e-8

Value for numerical stability in adam. Only used when solver='adam'.
1e-08
n_iter_no_change n_iter_no_change: int, default=10

Maximum number of epochs to not meet ``tol`` improvement.
Only effective when solver='sgd' or 'adam'.

.. versionadded:: 0.20
10
max_fun max_fun: int, default=15000

Only used when solver='lbfgs'. Maximum number of function calls.
The solver iterates until convergence (determined by ``tol``), number
of iterations reaches max_iter, or this number of function calls.
Note that number of function calls will be greater than or equal to
the number of iterations for the MLPRegressor.

.. versionadded:: 0.22
15000

What we just did (step by step)¶

1. Defining the model architecture¶

We created a neural network with:

  • two hidden layers
  • 64 neurons in the first layer
  • 32 neurons in the second layer

2. Activation function¶

We use the ReLU activation function in the hidden layers.

ReLU:

  • introduces non-linearity
  • helps the model learn complex patterns
  • improves training stability

The output layer uses a linear activation, which is appropriate for regression.

3. Optimization algorithm¶

The model uses the Adam optimizer.

Adam:

  • adapts learning rates automatically
  • handles noisy gradients well
  • is a strong default choice for neural networks

4. Training iterations¶

The parameter max_iter=500 controls the maximum number of training iterations.

Each iteration:

  • computes predictions
  • measures prediction error
  • updates model parameters

Training stops when:

  • the maximum number of iterations is reached
  • or the optimization converges

5. What scikit-learn handles internally¶

scikit-learn automatically performs:

  • forward passes
  • loss computation (mean squared error)
  • gradient calculation
  • parameter updates

This makes the training process:

  • concise
  • robust
  • easy to use

At the cost of:

  • reduced control
  • limited customization

Key takeaway¶

scikit-learn allows us to train a deep learning regression model with very little code.

The core learning principles are identical to PyTorch and TensorFlow, but the implementation is fully abstracted.

In the next section, we will analyze model behavior and the most important parameters that influence regression performance.


7. Model behavior and key parameters¶

In this section we analyze how the deep learning regressor behaves during training and which parameters most strongly influence its performance.

Unlike linear regression, the behavior of a neural network emerges from the interaction of architecture, optimization, and data.

Model capacity and architecture¶

The architecture determines how expressive the model is.

In this notebook, the network has:

  • two hidden layers
  • 64 neurons in the first layer
  • 32 neurons in the second layer

This gives the model enough capacity to learn complex, non-linear relationships.

However:

  • higher capacity increases the risk of overfitting
  • smaller datasets are more sensitive to model size

Architecture choices in real-world scenarios¶

In real-world problems, the architecture of a neural network is often adjusted based on:

  • dataset size
  • feature complexity
  • problem difficulty

In such cases, it is common to increase model depth by adding more hidden layers.

Example of a deeper architecture (conceptual)¶

For example, a deeper regression network could be defined with additional hidden layers, such as:

MLPRegressor(
    hidden_layer_sizes=(128, 64, 32, 16),
    activation="relu",
    solver="adam"
)

Role of hidden layers in regression¶

Hidden layers allow the model to:

  • combine input features in non-linear ways
  • capture interactions between variables
  • approximate complex regression functions

Each layer transforms the input into a representation that makes the target value easier to predict.

Deeper representations do not guarantee better performance, but they increase expressive power.

Optimization behavior¶

The model is trained using gradient-based optimization.

Key aspects of this process:

  • the model minimizes prediction error
  • parameters are updated iteratively
  • learning happens through many small adjustments

The Adam optimizer:

  • adapts learning rates automatically
  • speeds up convergence
  • works well across many regression problems

Effect of training iterations¶

The number of training iterations controls how long the model learns.

  • too few iterations → underfitting
  • too many iterations → overfitting

Because neural networks are flexible, they can fit training data very closely if allowed to train for too long.

Overfitting in regression¶

Overfitting in regression occurs when:

  • training error becomes very small
  • test error stops improving or increases

The model may start fitting:

  • noise
  • outliers
  • dataset-specific patterns

This behavior is common when deep models are applied to limited datasets.

Sensitivity to feature scaling¶

Neural network regressors are highly sensitive to feature scale.

Without proper scaling:

  • gradients become unstable
  • convergence slows down
  • training may fail completely

Standardization is therefore an essential part of the pipeline, not an optional preprocessing step.

Comparison with classical regression models¶

Compared to linear regression:

  • neural networks are more expressive
  • they capture non-linear relationships
  • they require more careful tuning

The performance gain comes at the cost of:

  • reduced interpretability
  • higher computational complexity

Key takeaway¶

The behavior of a deep learning regressor is determined by:

  • model architecture
  • training duration
  • optimization strategy
  • data preprocessing

Understanding these factors helps explain why the model performs well or why it may fail to generalize.

In the next section, we will use the trained model to generate predictions and evaluate regression performance.


8. Predictions¶

In this section we use the trained neural network to generate predictions on unseen test data.

For regression models, predictions are continuous numerical values, not class labels.

In [10]:
# ====================================
# Predictions
# ====================================

y_pred = mlp_regressor.predict(X_test_scaled)

What the model outputs¶

The neural network outputs:

  • one numerical value per input sample
  • representing the predicted target value

Each prediction corresponds to:

  • a continuous estimate
  • not constrained to a fixed range
  • directly comparable to the true target

Interpretation of predictions¶

In regression:

  • the goal is not exact equality
  • small differences are expected
  • performance is evaluated using error metrics

Predictions should be analyzed:

  • numerically (error metrics)
  • visually (optional plots)
  • comparatively (against baselines)

At this point, we have:

  • true target values (y_test)
  • predicted target values (y_pred)

In the next section, we will evaluate regression performance using standard regression metrics.


9. Model evaluation¶

In this section we evaluate the performance of the deep learning regression model on unseen test data.

For regression problems, evaluation focuses on prediction error rather than classification accuracy.

In [12]:
# ====================================
# Regression evaluation metrics
# ====================================

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

mse, mae, r2, rmse 
Out[12]:
(0.2742889195569193,
 0.35050323257082144,
 0.7906844930770192,
 np.float64(0.5237259966403418))

How to read these results together¶

No single metric is sufficient to fully evaluate a regression model.

Each metric answers a different and complementary question.

Mean Squared Error (MSE)¶

The Mean Squared Error measures the average squared difference between predicted and true values.

  • penalizes large errors strongly
  • sensitive to outliers
  • lower values indicate better performance

Mean Absolute Error (MAE)¶

The Mean Absolute Error measures the average absolute difference between predictions and true values.

  • easier to interpret than MSE
  • less sensitive to outliers
  • expressed in the same units as the target

R² score (coefficient of determination)¶

The R² score measures how much variance in the target variable is explained by the model.

  • R² = 1 → perfect prediction
  • R² = 0 → model performs like a constant predictor
  • R² < 0 → model performs worse than a baseline

Higher values indicate better explanatory power.

Root Mean Squared Error (RMSE)¶

In many regression problems, RMSE is one of the most important metrics.

RMSE:

  • is expressed in the same unit as the target
  • provides a direct measure of prediction error
  • is easier to interpret than MSE

For this reason, RMSE is often preferred when communicating model performance.

Role of each metric¶

  • RMSE
    Measures the typical prediction error
    in the same unit as the target variable.
    It is often the most intuitive metric for practical interpretation.


  • MAE
    Measures the average absolute error.
    It is less sensitive to outliers and provides a robust view of performance.

  • R² score
    Measures how much of the target variance is explained by the model.
    It reflects overall fit quality, not absolute error magnitude.

Why RMSE is especially important¶

In many real-world regression problems, RMSE is one of the most important metrics.

This is because:

  • it is directly interpretable
  • it reflects typical prediction error
  • it penalizes large errors more strongly

For this reason, RMSE is often preferred when communicating model performance to non-technical stakeholders.

Interpreting metrics together¶

A good regression model typically shows:

  • low RMSE
  • low MAE
  • high R²

However:

  • a high R² does not guarantee low error
  • a low error does not guarantee strong generalization

Metrics must always be interpreted together and on unseen test data.

Key takeaway¶

Deep learning regression models must be evaluated using multiple metrics.

RMSE and MAE describe how much the model errs, while R² describes how well the model explains the data.

Using these metrics together provides a complete and reliable assessment of regression performance.

In the next section, we will discuss when deep learning regression is an appropriate choice and when simpler models may be preferable.


10. When to use it and when not to¶

Deep Learning regression models are powerful, but they are not always the best choice.

Choosing this approach depends on:

  • data complexity
  • dataset size
  • performance requirements
  • interpretability constraints

When to use Deep Learning for regression¶

Deep learning regression is a good choice when:

  • the relationship between features and target is non-linear
  • feature interactions are complex
  • classical linear models underperform
  • prediction accuracy is more important than interpretability
  • sufficient training data is available

It is particularly useful for:

  • complex tabular data
  • problems with hidden patterns
  • scenarios where flexibility is required

When NOT to use Deep Learning for regression¶

Deep learning regression may not be ideal when:

  • the dataset is small
  • the relationship is approximately linear
  • model interpretability is critical
  • training time or resources are limited

In these cases, simpler regression models are often more efficient and reliable.

Practical warning signs¶

You should be cautious if:

  • training error is very low
  • test error does not improve
  • model complexity grows quickly
  • performance gains are marginal

These are common indicators that deep learning may be unnecessary for the problem at hand.

Comparison with classical regression¶

Compared to linear regression:

  • deep learning models are more expressive
  • they can capture non-linear relationships
  • they require more tuning and care

The performance gain comes at the cost of:

  • reduced interpretability
  • increased computational complexity

Key takeaway¶

Deep Learning regression models are powerful tools for complex problems, but they should not be used by default.

Model choice should always balance:

  • accuracy
  • complexity
  • interpretability
  • maintainability

In the next section, we will save the trained model and complete the regression pipeline.


11. Model persistence¶

In this section we save the trained deep learning regression model and the preprocessing steps used during training.

Saving the model allows us to:

  • reuse it without retraining
  • ensure reproducibility
  • separate training from inference
In [ ]:
# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)

# Save trained model
joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")

# Save scaler
joblib.dump(scaler, model_dir / "scaler.joblib")

What we have saved¶

We saved:

  • the trained neural network regressor
  • the feature scaler used during preprocessing

These components together form the complete regression pipeline.

Why saving the scaler is essential¶

Neural networks are highly sensitive to feature scaling.

Using a different scaler would lead to inconsistent predictions.

Saving the scaler ensures that new data is transformed in exactly the same way as during training.

How the model can be reused¶

To reuse the model:

  1. load the scaler
  2. apply it to new input data
  3. load the trained regressor
  4. generate predictions

This guarantees consistency between training and inference.


12. Mathematical formulation (deep dive)¶

This section provides a conceptual and mathematical view of deep learning regression.

The goal is to understand what is optimized and how predictions are produced, without going into low-level implementation details.

Representation of the data¶

The regression dataset is represented as:

$$ \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} $$

where:

  • $x_i \in \mathbb{R}^d$ is a feature vector
  • $y_i \in \mathbb{R}$ is a continuous target value

Neural network as a function approximator¶

A neural network learns a function:

$$ \hat{y} = f(x; \theta) $$

where:

  • $x$ is the input feature vector
  • $\theta$ represents all weights and biases
  • $\hat{y}$ is the predicted value

The function $f$ is non-linear and composed of multiple layers.

Layer-wise transformation¶

Each hidden layer applies a transformation of the form:

$$ h = \sigma(Wx + b) $$

where:

  • $W$ is the weight matrix
  • $b$ is the bias vector
  • $\sigma$ is a non-linear activation function (ReLU)

This process is repeated across layers, building increasingly abstract representations.

Output layer for regression¶

For regression, the output layer is linear:

$$ \hat{y} = W_{\text{out}} h + b_{\text{out}} $$

This allows the model to predict any real-valued number.

Loss function¶

The model is trained by minimizing the Mean Squared Error (MSE):

$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

This loss penalizes large errors and encourages accurate predictions.

Optimization process¶

Training consists of:

  • computing predictions
  • measuring error via the loss function
  • updating parameters using gradients

An optimizer (e.g. Adam) adjusts the parameters iteratively to minimize the loss.

Bias–variance perspective¶

Deep learning regression models:

  • have low bias (high flexibility)
  • may have high variance if over-parameterized

Generalization depends on:

  • model capacity
  • data size
  • training duration
  • regularization effects

Final takeaway¶

Deep learning regression can be viewed as:

  • learning a non-linear function
  • minimizing prediction error
  • approximating complex relationships

The mathematical principles are simple, but their composition yields powerful models.


13. Final summary – Code only¶

The following cell contains the complete regression pipeline from data loading to model persistence.

No explanations are provided here on purpose.

This section is intended for:

  • quick execution
  • reference
  • reuse in scripts or applications
In [ ]:
# ====================================
# Imports
# ====================================

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

from sklearn.metrics import (
    mean_squared_error,
    mean_absolute_error,
    r2_score
)

from pathlib import Path
import joblib
import matplotlib.pyplot as plt


# ====================================
# Dataset loading
# ====================================

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target


# ====================================
# Train-test split
# ====================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


# ====================================
# Feature scaling
# ====================================

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# ====================================
# Model initialization
# ====================================

mlp_regressor = MLPRegressor(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    learning_rate_init=0.001,
    max_iter=500,
    random_state=42
)


# ====================================
# Model training
# ====================================

mlp_regressor.fit(X_train_scaled, y_train)


# ====================================
# Predictions
# ====================================

y_pred = mlp_regressor.predict(X_test_scaled)


# ====================================
# Model evaluation
# ====================================

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, rmse, mae, r2


# ====================================
# Model persistence
# ====================================

model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)

joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")