Supervised Learning -> KNN Regression (K-Nearest Neighbors)¶
The first section (1 - 4) is identical across all regression models: Keeping this part unchanged allows easy model comparison and prevents pipeline-related mistakes. Everything after that depends on the specific model.
- Project setup and common pipeline
- Dataset loading
- Train-test split
- Feature scaling (why we do it)
- What is this model? (Intuition)
- Model training
- Model behavior and key hyperparameters
- Predictions
- Model evaluation
- When to use it and when not to
- Model persistence
- Mathematical formulation (deep dive)
- Final summary – Code only
How this notebook should be read¶
This notebook is designed to be read top to bottom.
Before every code cell, you will find a short explanation describing:
- what we are about to do
- why this step is necessary
- how it fits into the overall process
The goal is not just to run the code, but to understand what is happening at each step and be able to adapt it to your own data.
What is KNN Regression?¶
KNN Regression is a very different model compared to Linear Regression.
Instead of learning a global formula or a line, KNN Regression makes predictions by looking at the data directly.
The idea is simple: to predict the value for a new data point, the model searches for the K closest data points in the training set.
Once these neighbors are found:
- their target values are collected
- the prediction is computed as the average of those values
There is no explicit training phase where parameters are learned. The model simply stores the training data and uses it at prediction time.
For this reason, KNN is often described as a lazy, instance-based model.
Why we start with intuition¶
Starting with intuition is especially important for KNN Regression.
All the model does is:
- measure distances between data points
- decide which points are close
- aggregate their target values
If this idea is clear, the rest of the notebook becomes easy to follow.
Every step in the code will reflect this logic: distance → neighbors → average → prediction.
What you should expect from the results¶
Before using KNN Regression, it is important to set expectations.
With KNN Regression, you should expect:
- predictions that adapt locally to the data
- good performance when similar data points exist nearby
- sensitivity to noise and outliers
This model often performs well as a flexible alternative when linear assumptions are too restrictive, but it may struggle on large datasets or high-dimensional data.
1. Project setup and common pipeline¶
In this section we prepare everything that is shared across all regression models.
The goal is to:
- set up a clean and reproducible environment
- import all common dependencies
- define a standard pipeline that will not change when switching models
This part of the notebook is intentionally kept simple and consistent. If something changes here, it should change in all models.
Why having a common pipeline matters¶
Using the same pipeline for all models allows us to:
- compare models fairly
- avoid data leakage
- reduce implementation errors
- focus on understanding the model instead of debugging the setup
From this point on, every model will start from the exact same data preparation steps.
# Used for numerical operations and data manipulation.
import numpy as np
import pandas as pd
# Provides a real-world regression dataset for demonstration purposes.
from sklearn.datasets import fetch_california_housing
# Ensures a clean separation between training and test data.
from sklearn.model_selection import train_test_split
# Applies feature scaling to ensure numerical stability and pipeline consistency.
from sklearn.preprocessing import StandardScaler
# Used later to assess model performance.
from sklearn.metrics import mean_squared_error, r2_score , mean_absolute_error
# Used for stabil path.
from pathlib import Path
# Used to save trained models and preprocessing objects.
import joblib
____________________________________¶
2. Dataset loading¶
In this section we load the dataset that will be used throughout the notebook.
The purpose of this step is to:
- obtain real data to work with
- clearly separate input features from the target variable
- create a structure that can be easily replaced with custom datasets
At this stage, no modeling is involved. We are only defining what the model will learn from.
About the dataset¶
We use the California Housing dataset, a classic regression dataset.
Each row represents a housing district in California. Each column represents a numerical feature describing that district. The target variable is a continuous value representing house prices.
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
What we obtained¶
X
A table of input features used by the model to make predictions.y
The target variable we want the model to predict.
This separation is fundamental:
- the model learns patterns from
X - the model is evaluated by comparing predictions to
y
Adapting this step to your own data:
X should contain only feature columns
y should contain the target variable
____________________________________¶
3. Train-test split¶
In this section we split the dataset into two separate parts:
- a training set
- a test set
This step is fundamental in machine learning and applies to almost every model.
Why we split the data¶
The goal of a machine learning model is not to memorize data, but to make good predictions on new, unseen data.
If we trained and evaluated the model on the same data:
- we would get overly optimistic results
- we would not know how well the model generalizes
By splitting the data:
- the training set is used to learn patterns
- the test set is used only for evaluation
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
What these parameters mean¶
test_size=0.2
20% of the data is reserved for testing,
80% is used for training.
That split can be change based on the dataset.
What we have after this step¶
X_train,y_train
Data used to train the model.X_test,y_test
Data kept aside and used only for evaluation.
From this point on:
- the model must never see the test data during training
- all learning happens exclusively on the training set
____________________________________¶
4. Feature scaling (why we do it)¶
In this section we apply feature scaling to the input data.
Feature scaling means transforming the numerical features so that they all follow a similar scale.
Why feature scaling is important¶
In many datasets, features can have very different ranges.
For example:
- one feature may range between 0 and 1
- another may range between 0 and 100,000
Without scaling:
- features with larger values can dominate the learning process
- numerical computations may become unstable
Does KNN require scaling?¶
Strictly speaking:
- KNN can work without feature scaling
However, we still apply scaling because:
- Numerical stability
- Pipeline consistency
Many other models (Ridge, Lasso, SGD, SVM) require scaled features. Keeping scaling in the pipeline avoids changing code later - Better coefficient interpretation
Important rule: fit only on training data¶
The scaler must be:
- fitted on the training data
- applied to both training and test data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
What we have after this step¶
X_train_scaled
Scaled features used to train the model.X_test_scaled
Scaled features
Now we are ready to introduce the model.
____________________________________¶
5. What is this model? (KNN Regression)¶
Before training the model, it is important to understand what KNN Regression is trying to do conceptually.
KNN Regression is a distance-based model. Unlike Linear Regression, it does not learn a global formula that describes the entire dataset.
The core idea¶
KNN Regression makes predictions by comparing data points.
To predict the value for a new input:
- the model looks at the training data
- finds the K closest data points (neighbors)
- computes the prediction as the average of their target values
The idea is that: points that are close to each other are likely to have similar target values.
Why distance matters¶
Distance is the key concept behind KNN.
The notion of "closeness" depends on:
- the feature values
- the distance metric
- proper feature scaling
If features are not scaled:
- distances become meaningless
- the model behaves incorrectly
This is why feature scaling is mandatory for KNN Regression.
Key takeaway¶
KNN Regression does not try to explain the data with a formula.
Instead, it answers the question: "How did similar data points behave in the past?"
____________________________________¶
6. Model training (KNN Regression)¶
In this section we train the KNN Regression model.
For KNN, training does not mean learning parameters or fitting a mathematical function.
Instead, training simply consists of:
- storing the training data
- preparing it to be used for distance-based predictions
What does "training" mean for KNN?¶
Unlike Linear Regression, KNN is a lazy model.
This means:
- no optimization is performed during training
- no coefficients are learned
- no global model is built
All the work happens at prediction time, when distances between data points are computed.
The role of K (number of neighbors)¶
The most important hyperparameter in KNN is K.
K represents:
- how many neighbors are considered when making a prediction
Choosing K involves a trade-off:
- small K → very local, sensitive to noise
- large K → smoother predictions, less flexible
from sklearn.neighbors import KNeighborsRegressor
# Initialize the KNN regressor
knn_model = KNeighborsRegressor(
n_neighbors=5
)
# "Train" the model
knn_model.fit(X_train_scaled, y_train)
KNeighborsRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
What we have after this step¶
After this step:
- the model has stored the training data
- the value of K has been fixed
- the model is ready to make predictions
The model will compute distances only when predictions are requested.
Important implication¶
Because KNN defers computation to prediction time:
- training is fast
- prediction can be slow on large datasets
- memory usage is higher than parametric models
This behavior will influence when and where KNN should be used.
____________________________________¶
7. Model behavior and key hyperparameters¶
In this section we analyze how the KNN Regression model behaves and which hyperparameters control its predictions.
Unlike Linear Regression, KNN does not produce coefficients. Its behavior is entirely determined by a small set of choices.
The most important hyperparameter: K¶
The value of K determines how many neighbors are used to compute each prediction.
This choice has a strong impact on model behavior:
Small K (e.g. K = 1 or 3):
- predictions are very local
- the model is sensitive to noise
- high variance, low bias
Large K (e.g. K = 20 or more):
- predictions are smoother
- the model is less sensitive to noise
- higher bias, lower variance
Distance metric¶
KNN relies on a distance metric to define what “close” means.
The choice of distance metric affects:
- neighbor selection
- prediction behavior
- model sensitivity to feature distributions
Uniform vs distance-based weighting¶
KNN can weight neighbors in different ways:
Uniform weighting:
- all neighbors contribute equally
- prediction is the simple average
Distance-based weighting:
- closer neighbors have more influence
- farther neighbors contribute less
This choice can significantly affect predictions, especially when neighbors are unevenly distributed.
What does a prediction mean for KNN?¶
For KNN Regression, making a prediction means:
- taking a new input sample
- computing its distance to all training samples
- selecting the K closest neighbors
- averaging their target values
# Generate predictions on the test set
y_pred_knn = knn_model.predict(X_test_scaled)
What we obtained¶
y_pred_knncontains the predicted target values- each prediction is influenced by nearby training samples
- predictions can vary significantly across the input space
This local behavior is a defining characteristic of KNN.
Why evaluation is important for KNN¶
KNN can behave very differently depending on:
- the value of K
- the density of the data
- the presence of noise
Evaluation helps us determine whether the chosen configuration is appropriate for the given problem.
# Compute evaluation metrics for KNN Regression
mae = mean_absolute_error(y_test, y_pred_knn)
mse = mean_squared_error(y_test, y_pred_knn)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_knn)
mae, mse, rmse, r2
(0.4461535271317829, 0.4324216146043236, np.float64(0.6575877238850522), 0.6700101862970989)
Metrics interpretation¶
MAE (Mean Absolute Error)
Average absolute prediction error.
Easy to interpret and less sensitive to outliers.MSE (Mean Squared Error)
Penalizes large errors more heavily.RMSE (Root Mean Squared Error)
Expresses the error in the same units as the target variable.R² score
Indicates how much variance in the target variable is explained by the model.
Which metric should we focus on when comparing models?¶
When comparing regression models, there is no single metric that is always the “best”.
However, in practice, RMSE is often the most informative metric and the one most commonly used for model comparison.
Why RMSE?
- It is expressed in the same units as the target variable
- It penalizes large errors more than small ones
- It reflects how wrong predictions are in realistic scenarios
MAE is useful for understanding average error, and R² is useful for understanding explained variance, but RMSE usually provides the best overall picture when comparing different models on the same dataset.
For this reason, RMSE is typically the primary metric used to compare regression models in this project.
Interpreting the RMSE value¶
In this case, the RMSE is approximately 0.66.
This means that, on average, the model’s predictions differ from the true values by about 0.66 units of the target variable.
Since the target in this dataset represents house prices in hundreds of thousands of dollars, this corresponds to an average error of roughly $66,000.
It is important to remember that:
- this is an average measure
- some predictions may be much more accurate
- others may have larger errors
____________________________________¶
10. When to use it and when not to (KNN Regression)¶
Knowing when to use KNN Regression is essential to avoid misuse and misleading results.
KNN is a powerful model in the right context, but it also has clear limitations.
When KNN Regression is a good choice¶
KNN Regression works well when:
- The relationship between features and target is non-linear
- Similar data points tend to have similar target values
- The dataset is not extremely large
- Feature scaling is properly applied
- Local patterns are more important than global trends
Why saving the model is important¶
Even though KNN does not learn parameters in the traditional sense, saving the model is still essential.
The saved model contains:
- the training data
- the chosen hyperparameters (such as K)
- the configuration needed to reproduce predictions
This allows the model to be reused consistently.
Important rule: save the scaler together with the model¶
KNN relies on distance calculations.
For this reason:
- new input data must be scaled in the same way
- using a different scaler would lead to incorrect distances
The model and the scaler must always be saved and loaded together.
from pathlib import Path
import joblib
# Define model directory
model_dir = Path("models/supervised_learning/regression/knn_regression")
# Create directory if it does not exist
model_dir.mkdir(parents=True, exist_ok=True)
# Save model and scaler
joblib.dump(knn_model, model_dir / "knn_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")
['models\\supervised_learning\\regression\\knn_regression\\scaler.joblib']
What we have now¶
- A trained KNN Regression model
- A fitted feature scaler
- Both saved and ready to be reused
At this point, the model can be:
- loaded in another notebook
- used in an application
- evaluated on new data
____________________________________¶
Loading the model later (conceptual example)¶
To use the model in the future:
- load the scaler
- scale new input data
- load the KNN model
- generate predictions
This ensures full consistency with the original training pipeline.
Representation of the data¶
KNN Regression operates directly on the training data.
Each training sample can be represented as:
- a feature vector xᵢ
- a corresponding target value yᵢ
The full training set is stored and used during prediction.
Distance computation¶
To make a prediction for a new input x, the model computes the distance between x and every training sample xᵢ.
By default, this distance is the Euclidean distance.
This step determines which samples are considered the "nearest neighbors".
Neighbor selection and prediction¶
Once distances are computed:
- the K closest samples are selected
- their target values are retrieved
In KNN Regression, the prediction is computed as:
- the average of the target values of the selected neighbors
Optionally:
- closer neighbors can be given more weight
- farther neighbors can contribute less
Final takeaway¶
KNN Regression is conceptually simple but powerful.
It replaces a mathematical model with direct comparison between data points.
Understanding this mechanism explains:
- why scaling is mandatory
- why K strongly affects performance
- why KNN behaves very differently from linear models
____________________________________¶
Final summary – Code only¶
The following cell contains the complete pipeline from data loading to model persistence.
No explanations are provided here on purpose.
This section is intended for:
- quick execution
- reference
- reuse in scripts or applications
If you want to understand what each step does and why, read the notebook from top to bottom.
# ====================================
# Imports
# ====================================
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from pathlib import Path
import joblib
# ====================================
# Dataset loading
# ====================================
data = fetch_california_housing(as_frame=True)
X = data.data
y = data.target
# ====================================
# Train-test split
# ====================================
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
# ====================================
# Feature scaling (mandatory for KNN)
# ====================================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ====================================
# Model initialization
# ====================================
knn_model = KNeighborsRegressor(
n_neighbors=5
)
# ====================================
# Model training
# ====================================
knn_model.fit(X_train_scaled, y_train)
# ====================================
# Predictions
# ====================================
y_pred_knn = knn_model.predict(X_test_scaled)
# ====================================
# Model evaluation
# ====================================
mae = mean_absolute_error(y_test, y_pred_knn)
mse = mean_squared_error(y_test, y_pred_knn)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_knn)
mae, mse, rmse, r2
# ====================================
# Model persistence
# ====================================
model_dir = Path("models/supervised_learning/regression/knn_regression")
model_dir.mkdir(parents=True, exist_ok=True)
joblib.dump(knn_model, model_dir / "knn_regression_model.joblib")
joblib.dump(scaler, model_dir / "scaler.joblib")