Deep Learning – Regression (scikit-learn)¶
This notebook is part of the ML-Methods project.
It introduces Deep Learning for supervised regression using the scikit-learn implementation of neural networks.
As with all other notebooks in this project, the initial sections focus on data preparation and are intentionally repeated.
This ensures:
- conceptual consistency
- fair comparison across models
- a unified learning pipeline
Notebook Roadmap (standard ML-Methods)¶
- Project setup and common pipeline
- Dataset loading
- Train-test split
- Feature scaling (why we do it)
- What is this model? (Intuition)
- Model training
- Model behavior and key parameters
- Predictions
- Model evaluation
- When to use it and when not to
- Model persistence
- Mathematical formulation (deep dive)
- Final summary – Code only
How this notebook should be read¶
This notebook is designed to be read top to bottom.
Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process
The goal is not just to run the code, but to understand how deep learning regression differs from classical regression models and how it fits into the supervised learning pipeline.
What is Deep Learning (in this context)?¶
In this notebook, Deep Learning refers to neural networks with multiple layers used to solve regression problems.
Unlike classification:
- the target is a continuous value
- there are no class labels
- the model predicts a real number
The neural network learns a function:
input features → continuous output
What do we want to achieve?¶
Our objective is to train a model that:
- receives a vector of numerical features
- processes them through multiple layers
- outputs a single continuous value
The model learns how combinations of input features map to a numerical target.
This is useful when:
- relationships are non-linear
- classical linear regression is insufficient
- feature interactions are complex
Why use scikit-learn for Deep Learning regression?¶
scikit-learn provides a high-level abstraction
for neural networks through MLPRegressor.
This allows us to:
- reuse the same ML pipeline as classical models
- focus on concepts rather than low-level training details
- understand what deep learning regression does before implementing it manually
This notebook acts as a bridge between classical regression and full deep learning frameworks such as PyTorch and TensorFlow.
What you should expect from the results¶
With Deep Learning (scikit-learn regression), you should expect:
- ability to model non-linear relationships
- improved performance on complex patterns
- sensitivity to feature scaling
- longer training times than linear models
However:
- interpretability is low
- hyperparameter tuning is important
- the model behaves as a black box
1. Project setup and common pipeline¶
In this section we set up the common pipeline used across regression models in this project.
Although this notebook uses a neural network, the data preparation steps remain identical to other regression approaches.
# ====================================
# Common imports used across regression models
# ====================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score
)
from pathlib import Path
import joblib
import matplotlib.pyplot as plt
What changes compared to classical regression¶
Compared to linear regression:
- the target remains continuous
- the pipeline remains identical
- the model becomes non-linear
The main difference lies in the model itself, not in the surrounding workflow.
In the next section, we will load the dataset used for the regression task.
2. Dataset loading¶
In this section we load the dataset used for the deep learning regression task.
We use a regression-specific dataset with a continuous target variable, which allows us to evaluate how neural networks behave when predicting real-valued outputs.
# ====================================
# Dataset loading
# ====================================
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
Inputs and target¶
Xcontains the input featuresycontains the target variable
This is a supervised regression problem:
- each sample has a continuous target value
- the goal is to predict a real number, not a class
Why this dataset¶
The California Housing dataset is well suited for regression because:
- it contains numerical features
- relationships are non-linear
- target values are continuous
This makes it a good benchmark for comparing classical regression and deep learning regression models.
At this stage:
- data is still in pandas format
- no preprocessing has been applied yet
In the next section, we will split the dataset into training and test sets.
3. Train-test split¶
In this section we split the dataset into training and test sets.
This step allows us to evaluate how well the neural network regressor generalizes to unseen data.
# ====================================
# Train-test split
# ====================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Why this step is essential¶
A regression model must be evaluated on data it has never seen during training.
By splitting the data:
- the training set is used to learn the mapping from features to target values
- the test set is used only for evaluation
This prevents overly optimistic results and reflects real-world performance.
Choice of split ratio¶
An 80 / 20 split is a common default:
- enough data to train the model
- enough data to reliably evaluate predictions
At this point:
- training and test data are separated
- no preprocessing has been applied yet
In the next section, we will apply feature scaling.
For deep learning regression, this step is mandatory.
4. Feature scaling (why we do it)¶
In this section we apply feature scaling to the input data.
For deep learning regression models, feature scaling is mandatory.
Neural networks are trained using gradient-based optimization, which is highly sensitive to the scale of input features.
# ====================================
# Feature scaling
# ====================================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Why we use standardization here¶
We use standardization for feature scaling because neural networks rely on gradients to update their parameters.
Standardization:
- centers features around zero
- ensures comparable variance across features
- improves numerical stability during training
This helps:
- gradients behave more predictably
- optimization converge faster
- training remain stable across layers
At this stage:
- data is numerically ready
- still in NumPy format
In the next section, we will explain what this model is and how a neural network performs regression using scikit-learn.
5. What is this model? (Deep Learning Regression)¶
Before training the model, it is important to understand what a neural network does when it is used for regression.
Unlike classification, the goal is not to assign a class label, but to predict a continuous numerical value.
What problem are we solving?¶
In a regression problem:
- each input sample is described by multiple features
- the target is a real number
- predictions can take infinitely many values
The model learns a function:
input features → continuous output
The objective is to make predictions as close as possible to the true values.
How does a neural network perform regression?¶
A neural network performs regression by:
- Receiving a vector of input features
- Combining features through weighted sums
- Applying non-linear transformations
- Producing a single numerical output
Each layer transforms the input into a representation that is more informative for prediction.
What is different from linear regression?¶
Linear regression assumes:
- a linear relationship between inputs and target
- a single global equation
Neural network regression:
- does not assume linearity
- learns complex, non-linear relationships
- adapts to feature interactions automatically
This makes neural networks more expressive than linear models.
Why no activation in the output layer?¶
In regression:
- the output represents a real value
- there is no notion of class probability
For this reason:
- the output layer uses a linear activation
- the model can predict any real number
Non-linear activations are used only in the hidden layers.
How learning happens conceptually¶
Learning follows an iterative process:
- The model makes a numerical prediction
- The prediction is compared to the true value
- An error is computed
- Model parameters are adjusted
- The process repeats
Over time, the model reduces the prediction error and improves accuracy.
Why neural networks can overfit in regression¶
Neural networks have high expressive power.
This means:
- they can fit training data extremely well
- they may capture noise instead of structure
Overfitting occurs when:
- training error becomes very small
- test error stops improving or increases
This behavior is common in deep learning regression models, especially on small datasets.
Key takeaway¶
Deep Learning regression models:
- predict continuous values
- learn non-linear feature relationships
- require careful scaling and evaluation
They are powerful tools when classical regression models are not expressive enough.
In the next section, we will train the neural network regressor using scikit-learn.
6. Model training (Deep Learning Regression)¶
In this section we train a neural network regressor using scikit-learn.
Unlike classical regression models, this model performs real training: it iteratively adjusts internal parameters to minimize prediction error.
# ====================================
# Model initialization
# ====================================
mlp_regressor = MLPRegressor(
hidden_layer_sizes=(64, 32),
activation="relu",
solver="adam",
learning_rate_init=0.001,
max_iter=500,
random_state=42
)
# ====================================
# Model training
# ====================================
mlp_regressor.fit(X_train_scaled, y_train)
MLPRegressor(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
2. Activation function¶
We use the ReLU activation function in the hidden layers.
ReLU:
- introduces non-linearity
- helps the model learn complex patterns
- improves training stability
The output layer uses a linear activation, which is appropriate for regression.
3. Optimization algorithm¶
The model uses the Adam optimizer.
Adam:
- adapts learning rates automatically
- handles noisy gradients well
- is a strong default choice for neural networks
4. Training iterations¶
The parameter max_iter=500
controls the maximum number of training iterations.
Each iteration:
- computes predictions
- measures prediction error
- updates model parameters
Training stops when:
- the maximum number of iterations is reached
- or the optimization converges
5. What scikit-learn handles internally¶
scikit-learn automatically performs:
- forward passes
- loss computation (mean squared error)
- gradient calculation
- parameter updates
This makes the training process:
- concise
- robust
- easy to use
At the cost of:
- reduced control
- limited customization
Key takeaway¶
scikit-learn allows us to train a deep learning regression model with very little code.
The core learning principles are identical to PyTorch and TensorFlow, but the implementation is fully abstracted.
In the next section, we will analyze model behavior and the most important parameters that influence regression performance.
7. Model behavior and key parameters¶
In this section we analyze how the deep learning regressor behaves during training and which parameters most strongly influence its performance.
Unlike linear regression, the behavior of a neural network emerges from the interaction of architecture, optimization, and data.
Model capacity and architecture¶
The architecture determines how expressive the model is.
In this notebook, the network has:
- two hidden layers
- 64 neurons in the first layer
- 32 neurons in the second layer
This gives the model enough capacity to learn complex, non-linear relationships.
However:
- higher capacity increases the risk of overfitting
- smaller datasets are more sensitive to model size
Architecture choices in real-world scenarios¶
In real-world problems, the architecture of a neural network is often adjusted based on:
- dataset size
- feature complexity
- problem difficulty
In such cases, it is common to increase model depth by adding more hidden layers.
Example of a deeper architecture (conceptual)¶
For example, a deeper regression network could be defined with additional hidden layers, such as:
MLPRegressor(
hidden_layer_sizes=(128, 64, 32, 16),
activation="relu",
solver="adam"
)
Role of hidden layers in regression¶
Hidden layers allow the model to:
- combine input features in non-linear ways
- capture interactions between variables
- approximate complex regression functions
Each layer transforms the input into a representation that makes the target value easier to predict.
Deeper representations do not guarantee better performance, but they increase expressive power.
Optimization behavior¶
The model is trained using gradient-based optimization.
Key aspects of this process:
- the model minimizes prediction error
- parameters are updated iteratively
- learning happens through many small adjustments
The Adam optimizer:
- adapts learning rates automatically
- speeds up convergence
- works well across many regression problems
Effect of training iterations¶
The number of training iterations controls how long the model learns.
- too few iterations → underfitting
- too many iterations → overfitting
Because neural networks are flexible, they can fit training data very closely if allowed to train for too long.
Overfitting in regression¶
Overfitting in regression occurs when:
- training error becomes very small
- test error stops improving or increases
The model may start fitting:
- noise
- outliers
- dataset-specific patterns
This behavior is common when deep models are applied to limited datasets.
Sensitivity to feature scaling¶
Neural network regressors are highly sensitive to feature scale.
Without proper scaling:
- gradients become unstable
- convergence slows down
- training may fail completely
Standardization is therefore an essential part of the pipeline, not an optional preprocessing step.
Comparison with classical regression models¶
Compared to linear regression:
- neural networks are more expressive
- they capture non-linear relationships
- they require more careful tuning
The performance gain comes at the cost of:
- reduced interpretability
- higher computational complexity
Key takeaway¶
The behavior of a deep learning regressor is determined by:
- model architecture
- training duration
- optimization strategy
- data preprocessing
Understanding these factors helps explain why the model performs well or why it may fail to generalize.
In the next section, we will use the trained model to generate predictions and evaluate regression performance.
8. Predictions¶
In this section we use the trained neural network to generate predictions on unseen test data.
For regression models, predictions are continuous numerical values, not class labels.
# ====================================
# Predictions
# ====================================
y_pred = mlp_regressor.predict(X_test_scaled)
What the model outputs¶
The neural network outputs:
- one numerical value per input sample
- representing the predicted target value
Each prediction corresponds to:
- a continuous estimate
- not constrained to a fixed range
- directly comparable to the true target
Interpretation of predictions¶
In regression:
- the goal is not exact equality
- small differences are expected
- performance is evaluated using error metrics
Predictions should be analyzed:
- numerically (error metrics)
- visually (optional plots)
- comparatively (against baselines)
At this point, we have:
- true target values (
y_test) - predicted target values (
y_pred)
In the next section, we will evaluate regression performance using standard regression metrics.
9. Model evaluation¶
In this section we evaluate the performance of the deep learning regression model on unseen test data.
For regression problems, evaluation focuses on prediction error rather than classification accuracy.
# ====================================
# Regression evaluation metrics
# ====================================
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)
mse, mae, r2, rmse
(0.2742889195569193, 0.35050323257082144, 0.7906844930770192, np.float64(0.5237259966403418))
How to read these results together¶
No single metric is sufficient to fully evaluate a regression model.
Each metric answers a different and complementary question.
Mean Squared Error (MSE)¶
The Mean Squared Error measures the average squared difference between predicted and true values.
- penalizes large errors strongly
- sensitive to outliers
- lower values indicate better performance
Mean Absolute Error (MAE)¶
The Mean Absolute Error measures the average absolute difference between predictions and true values.
- easier to interpret than MSE
- less sensitive to outliers
- expressed in the same units as the target
R² score (coefficient of determination)¶
The R² score measures how much variance in the target variable is explained by the model.
- R² = 1 → perfect prediction
- R² = 0 → model performs like a constant predictor
- R² < 0 → model performs worse than a baseline
Higher values indicate better explanatory power.
Root Mean Squared Error (RMSE)¶
In many regression problems, RMSE is one of the most important metrics.
RMSE:
- is expressed in the same unit as the target
- provides a direct measure of prediction error
- is easier to interpret than MSE
For this reason, RMSE is often preferred when communicating model performance.
Role of each metric¶
RMSE
Measures the typical prediction error
in the same unit as the target variable.
It is often the most intuitive metric for practical interpretation.MAE
Measures the average absolute error.
It is less sensitive to outliers and provides a robust view of performance.R² score
Measures how much of the target variance is explained by the model.
It reflects overall fit quality, not absolute error magnitude.
Why RMSE is especially important¶
In many real-world regression problems, RMSE is one of the most important metrics.
This is because:
- it is directly interpretable
- it reflects typical prediction error
- it penalizes large errors more strongly
For this reason, RMSE is often preferred when communicating model performance to non-technical stakeholders.
Interpreting metrics together¶
A good regression model typically shows:
- low RMSE
- low MAE
- high R²
However:
- a high R² does not guarantee low error
- a low error does not guarantee strong generalization
Metrics must always be interpreted together and on unseen test data.
Key takeaway¶
Deep learning regression models must be evaluated using multiple metrics.
RMSE and MAE describe how much the model errs, while R² describes how well the model explains the data.
Using these metrics together provides a complete and reliable assessment of regression performance.
In the next section, we will discuss when deep learning regression is an appropriate choice and when simpler models may be preferable.
10. When to use it and when not to¶
Deep Learning regression models are powerful, but they are not always the best choice.
Choosing this approach depends on:
- data complexity
- dataset size
- performance requirements
- interpretability constraints
When to use Deep Learning for regression¶
Deep learning regression is a good choice when:
- the relationship between features and target is non-linear
- feature interactions are complex
- classical linear models underperform
- prediction accuracy is more important than interpretability
- sufficient training data is available
It is particularly useful for:
- complex tabular data
- problems with hidden patterns
- scenarios where flexibility is required
When NOT to use Deep Learning for regression¶
Deep learning regression may not be ideal when:
- the dataset is small
- the relationship is approximately linear
- model interpretability is critical
- training time or resources are limited
In these cases, simpler regression models are often more efficient and reliable.
Practical warning signs¶
You should be cautious if:
- training error is very low
- test error does not improve
- model complexity grows quickly
- performance gains are marginal
These are common indicators that deep learning may be unnecessary for the problem at hand.
Comparison with classical regression¶
Compared to linear regression:
- deep learning models are more expressive
- they can capture non-linear relationships
- they require more tuning and care
The performance gain comes at the cost of:
- reduced interpretability
- increased computational complexity
Key takeaway¶
Deep Learning regression models are powerful tools for complex problems, but they should not be used by default.
Model choice should always balance:
- accuracy
- complexity
- interpretability
- maintainability
In the next section, we will save the trained model and complete the regression pipeline.
11. Model persistence¶
In this section we save the trained deep learning regression model and the preprocessing steps used during training.
Saving the model allows us to:
- reuse it without retraining
- ensure reproducibility
- separate training from inference
# ====================================
# Model persistence
# ====================================
model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)
# Save trained model
joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")
# Save scaler
joblib.dump(scaler, model_dir / "scaler.joblib")
What we have saved¶
We saved:
- the trained neural network regressor
- the feature scaler used during preprocessing
These components together form the complete regression pipeline.
Why saving the scaler is essential¶
Neural networks are highly sensitive to feature scaling.
Using a different scaler would lead to inconsistent predictions.
Saving the scaler ensures that new data is transformed in exactly the same way as during training.
How the model can be reused¶
To reuse the model:
- load the scaler
- apply it to new input data
- load the trained regressor
- generate predictions
This guarantees consistency between training and inference.
12. Mathematical formulation (deep dive)¶
This section provides a conceptual and mathematical view of deep learning regression.
The goal is to understand what is optimized and how predictions are produced, without going into low-level implementation details.
Representation of the data¶
The regression dataset is represented as:
$$ \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} $$
where:
- $x_i \in \mathbb{R}^d$ is a feature vector
- $y_i \in \mathbb{R}$ is a continuous target value
Neural network as a function approximator¶
A neural network learns a function:
$$ \hat{y} = f(x; \theta) $$
where:
- $x$ is the input feature vector
- $\theta$ represents all weights and biases
- $\hat{y}$ is the predicted value
The function $f$ is non-linear and composed of multiple layers.
Layer-wise transformation¶
Each hidden layer applies a transformation of the form:
$$ h = \sigma(Wx + b) $$
where:
- $W$ is the weight matrix
- $b$ is the bias vector
- $\sigma$ is a non-linear activation function (ReLU)
This process is repeated across layers, building increasingly abstract representations.
Output layer for regression¶
For regression, the output layer is linear:
$$ \hat{y} = W_{\text{out}} h + b_{\text{out}} $$
This allows the model to predict any real-valued number.
Loss function¶
The model is trained by minimizing the Mean Squared Error (MSE):
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
This loss penalizes large errors and encourages accurate predictions.
Optimization process¶
Training consists of:
- computing predictions
- measuring error via the loss function
- updating parameters using gradients
An optimizer (e.g. Adam) adjusts the parameters iteratively to minimize the loss.
Bias–variance perspective¶
Deep learning regression models:
- have low bias (high flexibility)
- may have high variance if over-parameterized
Generalization depends on:
- model capacity
- data size
- training duration
- regularization effects
Final takeaway¶
Deep learning regression can be viewed as:
- learning a non-linear function
- minimizing prediction error
- approximating complex relationships
The mathematical principles are simple, but their composition yields powerful models.
13. Final summary – Code only¶
The following cell contains the complete regression pipeline from data loading to model persistence.
No explanations are provided here on purpose.
This section is intended for:
- quick execution
- reference
- reuse in scripts or applications
# ====================================
# Imports
# ====================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import (
mean_squared_error,
mean_absolute_error,
r2_score
)
from pathlib import Path
import joblib
import matplotlib.pyplot as plt
# ====================================
# Dataset loading
# ====================================
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
# ====================================
# Train-test split
# ====================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
# ====================================
# Feature scaling
# ====================================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ====================================
# Model initialization
# ====================================
mlp_regressor = MLPRegressor(
hidden_layer_sizes=(64, 32),
activation="relu",
solver="adam",
learning_rate_init=0.001,
max_iter=500,
random_state=42
)
# ====================================
# Model training
# ====================================
mlp_regressor.fit(X_train_scaled, y_train)
# ====================================
# Predictions
# ====================================
y_pred = mlp_regressor.predict(X_test_scaled)
# ====================================
# Model evaluation
# ====================================
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse, rmse, mae, r2
# ====================================
# Model persistence
# ====================================
model_dir = Path("models/supervised_learning/regression/deep_learning_sklearn")
model_dir.mkdir(parents=True, exist_ok=True)
joblib.dump(mlp_regressor, model_dir / "mlp_regressor.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")