In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score
)
In [2]:
df = pd.read_csv("../data/classification.csv")
df.head()
Out[2]:
| feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | target | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.189853 | -2.853724 | -0.963144 | 0.318181 | 1.284353 | -1.228444 | -1.074027 | -1.831145 | 0 |
| 1 | 0.999778 | 1.436418 | 0.482381 | -1.586100 | -0.634295 | -0.393530 | 2.602929 | -1.119546 | 1 |
| 2 | 2.832780 | -0.761251 | -2.159733 | -0.235862 | -0.383240 | -0.322028 | -1.734361 | 1.614779 | 0 |
| 3 | -1.937572 | 0.665805 | -1.500188 | 2.501383 | 2.006982 | 0.359476 | 1.857416 | -0.478759 | 1 |
| 4 | -1.040299 | -1.983650 | -4.728988 | 4.836235 | 3.758709 | -1.968504 | -2.065338 | 0.034714 | 0 |
Dataset¶
The dataset contains:
- numerical features
- a binary target variable (0 or 1)
This is a classification problem, therefore:
- data points belong to discrete classes
- evaluation is based on a confusion matrix
In [3]:
X = df.drop("target", axis=1)
y = df["target"]
In [4]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape
Out[4]:
((480, 8), (120, 8))
Model Training¶
Logistic Regression is a linear model that outputs probabilities. A decision threshold is then applied to assign class labels.
In [5]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Out[5]:
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
In [6]:
y_pred = model.predict(X_test)
In [7]:
y_prob = model.predict_proba(X_test)
y_prob[:5]
Out[7]:
array([[0.88582787, 0.11417213],
[0.87867968, 0.12132032],
[0.89042122, 0.10957878],
[0.94380779, 0.05619221],
[0.93572647, 0.06427353]])
Confusion Matrix¶
The confusion matrix summarizes:
- correct predictions
- classification errors
It is the basis for all classification metrics.
In [8]:
cm = confusion_matrix(y_test, y_pred)
cm
Out[8]:
array([[80, 3],
[10, 27]])
Evaluation Metrics¶
For classification tasks we commonly use:
- Accuracy
- Precision
- Recall
- F1-score
Each metric captures a different aspect of model performance.
In [9]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print()
print(classification_report(y_test, y_pred))
Accuracy: 0.8916666666666667
precision recall f1-score support
0 0.89 0.96 0.92 83
1 0.90 0.73 0.81 37
accuracy 0.89 120
macro avg 0.89 0.85 0.87 120
weighted avg 0.89 0.89 0.89 120
Interpretation¶
- Accuracy measures overall correctness
- Precision measures reliability of positive predictions
- Recall measures how many positives are correctly identified
- F1-score balances precision and recall
Logistic Regression does not use Mean Squared Error, because the output is a class, not a continuous value.