|
| 1 | +.. _cross_validation: |
| 2 | + |
| 3 | +================ |
| 4 | +Cross validation |
| 5 | +================ |
| 6 | + |
| 7 | +.. currentmodule:: imblearn.model_selection |
| 8 | + |
| 9 | + |
| 10 | +.. _instance_hardness_threshold_cv: |
| 11 | + |
| 12 | +The term instance hardness is used in literature to express the difficulty to correctly |
| 13 | +classify an instance. An instance for which the predicted probability of the true class |
| 14 | +is low, has large instance hardness. The way these hard-to-classify instances are |
| 15 | +distributed over train and test sets in cross validation, has significant effect on the |
| 16 | +test set performance metrics. The :class:`~imblearn.model_selection.InstanceHardnessCV` |
| 17 | +splitter distributes samples with large instance hardness equally over the folds, |
| 18 | +resulting in more robust cross validation. |
| 19 | + |
| 20 | +We will discuss instance hardness in this document and explain how to use the |
| 21 | +:class:`~imblearn.model_selection.InstanceHardnessCV` splitter. |
| 22 | + |
| 23 | +Instance hardness and average precision |
| 24 | +======================================= |
| 25 | + |
| 26 | +Instance hardness is defined as 1 minus the probability of the most probable class: |
| 27 | + |
| 28 | +.. math:: |
| 29 | +
|
| 30 | + H(x) = 1 - P(\hat{y}|x) |
| 31 | +
|
| 32 | +In this equation :math:`H(x)` is the instance hardness for a sample with features |
| 33 | +:math:`x` and :math:`P(\hat{y}|x)` the probability of predicted label :math:`\hat{y}` |
| 34 | +given the features. If the model predicts label 0 and gives a `predict_proba` output |
| 35 | +of [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the |
| 36 | +instance hardness is `1-0.9=0.1`. |
| 37 | + |
| 38 | +Samples with large instance hardness have significant effect on the area under |
| 39 | +precision-recall curve, or average precision. Especially samples with label 0 |
| 40 | +with large instance hardness (so the model predicts label 1) reduce the average |
| 41 | +precision a lot as these points affect the precision-recall curve in the left |
| 42 | +where the area is largest; the precision is lowered in the range of low recall |
| 43 | +and high thresholds. When doing cross validation, e.g. in case of hyperparameter |
| 44 | +tuning or recursive feature elimination, random gathering of these points in |
| 45 | +some folds introduce variance in CV results that deteriorates robustness of the |
| 46 | +cross validation task. The :class:`~imblearn.model_selection.InstanceHardnessCV` |
| 47 | +splitter aims to distribute the samples with large instance hardness over the |
| 48 | +folds in order to reduce undesired variance. Note that one should use this |
| 49 | +splitter to make model *selection* tasks robust like hyperparameter tuning and |
| 50 | +feature selection but not for model *performance estimation* for which you also |
| 51 | +want to know the variance of performance to be expected in production. |
| 52 | + |
| 53 | + |
| 54 | +Create imbalanced dataset with samples with large instance hardness |
| 55 | +=================================================================== |
| 56 | + |
| 57 | +Let's start by creating a dataset to work with. We create a dataset with 5% class |
| 58 | +imbalance using scikit-learn's :func:`~sklearn.datasets.make_blobs` function. |
| 59 | + |
| 60 | + >>> import numpy as np |
| 61 | + >>> from matplotlib import pyplot as plt |
| 62 | + >>> from sklearn.datasets import make_blobs |
| 63 | + >>> from imblearn.datasets import make_imbalance |
| 64 | + >>> random_state = 10 |
| 65 | + >>> X, y = make_blobs(n_samples=[950, 50], centers=((-3, 0), (3, 0)), |
| 66 | + ... random_state=random_state) |
| 67 | + >>> plt.scatter(X[:, 0], X[:, 1], c=y) |
| 68 | + >>> plt.show() |
| 69 | + |
| 70 | +.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_001.png |
| 71 | + :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html |
| 72 | + :align: center |
| 73 | + |
| 74 | +Now we add some samples with large instance hardness |
| 75 | + |
| 76 | + >>> X_hard, y_hard = make_blobs(n_samples=10, centers=((3, 0), (-3, 0)), |
| 77 | + ... cluster_std=1, |
| 78 | + ... random_state=random_state) |
| 79 | + >>> X = np.vstack((X, X_hard)) |
| 80 | + >>> y = np.hstack((y, y_hard)) |
| 81 | + >>> plt.scatter(X[:, 0], X[:, 1], c=y) |
| 82 | + >>> plt.show() |
| 83 | + |
| 84 | +.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_002.png |
| 85 | + :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html |
| 86 | + :align: center |
| 87 | + |
| 88 | +Assess cross validation performance variance using `InstanceHardnessCV` splitter |
| 89 | +================================================================================ |
| 90 | + |
| 91 | +Then we take a :class:`~sklearn.linear_model.LogisticRegression` and assess the |
| 92 | +cross validation performance using a :class:`~sklearn.model_selection.StratifiedKFold` |
| 93 | +cv splitter and the :func:`~sklearn.model_selection.cross_validate` function. |
| 94 | + |
| 95 | + >>> from sklearn.ensemble import LogisticRegressionClassifier |
| 96 | + >>> clf = LogisticRegressionClassifier(random_state=random_state) |
| 97 | + >>> skf_cv = StratifiedKFold(n_splits=5, shuffle=True, |
| 98 | + ... random_state=random_state) |
| 99 | + >>> skf_result = cross_validate(clf, X, y, cv=skf_cv, scoring="average_precision") |
| 100 | + |
| 101 | +Now, we do the same using an :class:`~imblearn.model_selection.InstanceHardnessCV` |
| 102 | +splitter. We use provide our classifier to the splitter to calculate instance hardness |
| 103 | +and distribute samples with large instance hardness equally over the folds. |
| 104 | + |
| 105 | + >>> ih_cv = InstanceHardnessCV(estimator=clf, n_splits=5, |
| 106 | + ... random_state=random_state) |
| 107 | + >>> ih_result = cross_validate(clf, X, y, cv=ih_cv, scoring="average_precision") |
| 108 | + |
| 109 | +When we plot the test scores for both cv splitters, we see that the variance using the |
| 110 | +:class:`~imblearn.model_selection.InstanceHardnessCV` splitter is lower than for the |
| 111 | +:class:`~sklearn.model_selection.StratifiedKFold` splitter. |
| 112 | + |
| 113 | + >>> plt.boxplot([skf_result['test_score'], ih_result['test_score']], |
| 114 | + ... tick_labels=["StratifiedKFold", "InstanceHardnessCV"], |
| 115 | + ... vert=False) |
| 116 | + >>> plt.xlabel('Average precision') |
| 117 | + >>> plt.tight_layout() |
| 118 | + |
| 119 | +.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_003.png |
| 120 | + :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html |
| 121 | + :align: center |
| 122 | + |
| 123 | +Be aware that the most important part of cross-validation splitters is to simulate the |
| 124 | +conditions that one will encounter in production. Therefore, if it is likely to get |
| 125 | +difficult samples in production, one should use a cross-validation splitter that |
| 126 | +emulates this situation. In our case, the |
| 127 | +:class:`~sklearn.model_selection.StratifiedKFold` splitter did not allow to distribute |
| 128 | +the difficult samples over the folds and thus it was likely a problem for our use case. |
0 commit comments