In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score
)
from sklearn.preprocessing import StandardScaler
In [2]:
df = pd.read_csv("../data/classification.csv")
df.head()
Out[2]:
| feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | target | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.189853 | -2.853724 | -0.963144 | 0.318181 | 1.284353 | -1.228444 | -1.074027 | -1.831145 | 0 |
| 1 | 0.999778 | 1.436418 | 0.482381 | -1.586100 | -0.634295 | -0.393530 | 2.602929 | -1.119546 | 1 |
| 2 | 2.832780 | -0.761251 | -2.159733 | -0.235862 | -0.383240 | -0.322028 | -1.734361 | 1.614779 | 0 |
| 3 | -1.937572 | 0.665805 | -1.500188 | 2.501383 | 2.006982 | 0.359476 | 1.857416 | -0.478759 | 1 |
| 4 | -1.040299 | -1.983650 | -4.728988 | 4.836235 | 3.758709 | -1.968504 | -2.065338 | 0.034714 | 0 |
Dataset and Feature Scaling¶
KNN is a distance-based algorithm. Therefore, feature scaling is essential to ensure that all features contribute equally to distance calculations.
In [3]:
X = df.drop("target", axis=1)
y = df["target"]
In [4]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
In [5]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Model Definition¶
KNN does not learn an explicit model. Instead, it:
- stores the training data
- makes predictions based on the closest neighbors
The parameter k controls how many neighbors are considered.
In [6]:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)
Out[6]:
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Prediction¶
Predictions are made by majority vote among the nearest neighbors.
In [7]:
y_pred = model.predict(X_test_scaled)
Evaluation¶
As a classification model, KNN is evaluated using:
- confusion matrix
- accuracy
- precision, recall and F1-score
In [8]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print()
print(confusion_matrix(y_test, y_pred))
print()
print(classification_report(y_test, y_pred))
Accuracy: 0.925
[[83 0]
[ 9 28]]
precision recall f1-score support
0 0.90 1.00 0.95 83
1 1.00 0.76 0.86 37
accuracy 0.93 120
macro avg 0.95 0.88 0.91 120
weighted avg 0.93 0.93 0.92 120
Interpretation¶
Key characteristics of KNN:
- non-parametric model
- sensitive to feature scaling
- prediction time increases with dataset size
- performance depends heavily on the choice of
k
Effect of the Parameter k¶
- small k → sensitive to noise (overfitting)
- large k → smoother decision boundary (underfitting)
Choosing k is a trade-off.