ML Methods
Linear Regression Logistic Regression KNN Random Forest

Logistic Regression¶

Objective¶

The goal of this notebook is to perform binary classification using Logistic Regression.

This notebook demonstrates:

  • probability-based classification
  • decision threshold
  • confusion matrix
  • precision, recall and F1-score

Problem type: Classification

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    accuracy_score
)
In [2]:
df = pd.read_csv("../data/classification.csv")
df.head()
Out[2]:
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 target
0 0.189853 -2.853724 -0.963144 0.318181 1.284353 -1.228444 -1.074027 -1.831145 0
1 0.999778 1.436418 0.482381 -1.586100 -0.634295 -0.393530 2.602929 -1.119546 1
2 2.832780 -0.761251 -2.159733 -0.235862 -0.383240 -0.322028 -1.734361 1.614779 0
3 -1.937572 0.665805 -1.500188 2.501383 2.006982 0.359476 1.857416 -0.478759 1
4 -1.040299 -1.983650 -4.728988 4.836235 3.758709 -1.968504 -2.065338 0.034714 0

Dataset¶

The dataset contains:

  • numerical features
  • a binary target variable (0 or 1)

This is a classification problem, therefore:

  • data points belong to discrete classes
  • evaluation is based on a confusion matrix
In [3]:
X = df.drop("target", axis=1)
y = df["target"]
In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape
Out[4]:
((480, 8), (120, 8))

Model Training¶

Logistic Regression is a linear model that outputs probabilities. A decision threshold is then applied to assign class labels.

In [5]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Out[5]:
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2'

Specify the norm of the penalty:

- `None`: no penalty is added;
- `'l2'`: add a L2 penalty term and it is the default choice;
- `'l1'`: add a L1 penalty term;
- `'elasticnet'`: both L1 and L2 penalty terms are added.

.. warning::
Some penalties may not work with some solvers. See the parameter
`solver` below, to know the compatibility between the penalty and
solver.

.. versionadded:: 0.19
l1 penalty with SAGA solver (allowing 'multinomial' + L1)

.. deprecated:: 1.8
`penalty` was deprecated in version 1.8 and will be removed in 1.10.
Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for
`penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for
`'penalty='elasticnet'`.
'deprecated'
C C: float, default=1.0

Inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger
regularization. `C=np.inf` results in unpenalized logistic regression.
For a visual example on the effect of tuning the `C` parameter
with an L1 penalty, see:
:ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.
1.0
l1_ratio l1_ratio: float, default=0.0

The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting
`l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty.
Any value between 0 and 1 gives an Elastic-Net penalty of the form
`l1_ratio * L1 + (1 - l1_ratio) * L2`.

.. warning::
Certain values of `l1_ratio`, i.e. some penalties, may not work with some
solvers. See the parameter `solver` below, to know the compatibility between
the penalty and solver.

.. versionchanged:: 1.8
Default value changed from None to 0.0.

.. deprecated:: 1.8
`None` is deprecated and will be removed in version 1.10. Always use
`l1_ratio` to specify the penalty type.
0.0
dual dual: bool, default=False

Dual (constrained) or primal (regularized, see also
:ref:`this equation `) formulation. Dual formulation
is only implemented for l2 penalty with liblinear solver. Prefer `dual=False`
when n_samples > n_features.
False
tol tol: float, default=1e-4

Tolerance for stopping criteria.
0.0001
fit_intercept fit_intercept: bool, default=True

Specifies if a constant (a.k.a. bias or intercept) should be
added to the decision function.
True
intercept_scaling intercept_scaling: float, default=1

Useful only when the solver `liblinear` is used
and `self.fit_intercept` is set to `True`. In this case, `x` becomes
`[x, self.intercept_scaling]`,
i.e. a "synthetic" feature with constant value equal to
`intercept_scaling` is appended to the instance vector.
The intercept becomes
``intercept_scaling * synthetic_feature_weight``.

.. note::
The synthetic feature weight is subject to L1 or L2
regularization as all other features.
To lessen the effect of regularization on synthetic feature weight
(and therefore on the intercept) `intercept_scaling` has to be increased.
1
class_weight class_weight: dict or 'balanced', default=None

Weights associated with classes in the form ``{class_label: weight}``.
If not given, all classes are supposed to have weight one.

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.

.. versionadded:: 0.17
*class_weight='balanced'*
None
random_state random_state: int, RandomState instance, default=None

Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the
data. See :term:`Glossary ` for details.
None
solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs'

Algorithm to use in the optimization problem. Default is 'lbfgs'.
To choose a solver, you might want to consider the following aspects:

- 'lbfgs' is a good default solver because it works reasonably well for a wide
class of problems.
- For :term:`multiclass` problems (`n_classes >= 3`), all solvers except
'liblinear' minimize the full multinomial loss, 'liblinear' will raise an
error.
- 'newton-cholesky' is a good choice for
`n_samples` >> `n_features * n_classes`, especially with one-hot encoded
categorical features with rare categories. Be aware that the memory usage
of this solver has a quadratic dependency on `n_features * n_classes`
because it explicitly computes the full Hessian matrix.
- For small datasets, 'liblinear' is a good choice, whereas 'sag'
and 'saga' are faster for large ones;
- 'liblinear' can only handle binary classification by default. To apply a
one-versus-rest scheme for the multiclass setting one can wrap it with the
:class:`~sklearn.multiclass.OneVsRestClassifier`.

.. warning::
The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`
for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for
Elastic-Net) and on (multinomial) multiclass support:

================= ======================== ======================
solver l1_ratio multinomial multiclass
================= ======================== ======================
'lbfgs' l1_ratio=0 yes
'liblinear' l1_ratio=1 or l1_ratio=0 no
'newton-cg' l1_ratio=0 yes
'newton-cholesky' l1_ratio=0 yes
'sag' l1_ratio=0 yes
'saga' 0<=l1_ratio<=1 yes
================= ======================== ======================

.. note::
'sag' and 'saga' fast convergence is only guaranteed on features
with approximately the same scale. You can preprocess the data with
a scaler from :mod:`sklearn.preprocessing`.

.. seealso::
Refer to the :ref:`User Guide ` for more
information regarding :class:`LogisticRegression` and more specifically the
:ref:`Table `
summarizing solver/penalty supports.

.. versionadded:: 0.17
Stochastic Average Gradient (SAG) descent solver. Multinomial support in
version 0.18.
.. versionadded:: 0.19
SAGA solver.
.. versionchanged:: 0.22
The default solver changed from 'liblinear' to 'lbfgs' in 0.22.
.. versionadded:: 1.2
newton-cholesky solver. Multinomial support in version 1.6.
'lbfgs'
max_iter max_iter: int, default=100

Maximum number of iterations taken for the solvers to converge.
1000
verbose verbose: int, default=0

For the liblinear and lbfgs solvers set verbose to any positive
number for verbosity.
0
warm_start warm_start: bool, default=False

When set to True, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
Useless for liblinear solver. See :term:`the Glossary `.

.. versionadded:: 0.17
*warm_start* to support *lbfgs*, *newton-cg*, *sag*, *saga* solvers.
False
n_jobs n_jobs: int, default=None

Does not have any effect.

.. deprecated:: 1.8
`n_jobs` is deprecated in version 1.8 and will be removed in 1.10.
None

Prediction¶

We generate:

  • class predictions
  • predicted probabilities
In [6]:
y_pred = model.predict(X_test)
In [7]:
y_prob = model.predict_proba(X_test)

y_prob[:5]
Out[7]:
array([[0.88582787, 0.11417213],
       [0.87867968, 0.12132032],
       [0.89042122, 0.10957878],
       [0.94380779, 0.05619221],
       [0.93572647, 0.06427353]])

Confusion Matrix¶

The confusion matrix summarizes:

  • correct predictions
  • classification errors

It is the basis for all classification metrics.

In [8]:
cm = confusion_matrix(y_test, y_pred)
cm
Out[8]:
array([[80,  3],
       [10, 27]])

Evaluation Metrics¶

For classification tasks we commonly use:

  • Accuracy
  • Precision
  • Recall
  • F1-score

Each metric captures a different aspect of model performance.

In [9]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print()
print(classification_report(y_test, y_pred))
Accuracy: 0.8916666666666667

              precision    recall  f1-score   support

           0       0.89      0.96      0.92        83
           1       0.90      0.73      0.81        37

    accuracy                           0.89       120
   macro avg       0.89      0.85      0.87       120
weighted avg       0.89      0.89      0.89       120

Interpretation¶

  • Accuracy measures overall correctness
  • Precision measures reliability of positive predictions
  • Recall measures how many positives are correctly identified
  • F1-score balances precision and recall

Logistic Regression does not use Mean Squared Error, because the output is a class, not a continuous value.