Examples: Text sentiment (LinearSVC) with class rebalancing on tweet_eval #1150
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add a runnable example demonstrating 3-class text sentiment classification (negative/neutral/positive) with an imbalanced-learn pipeline:
TfidfVectorizer → RandomUnderSampler → LinearSVC
The example uses the tweet_eval/sentiment dataset and shows how to handle class imbalance in sparse text workflows. It prints a balanced-accuracy score and a classification report and can optionally save a confusion-matrix image.
Files added
examples/text_sentiment_svm_with_resampling.py
imblearn/tests/test_text_sentiment_example.py (small smoke test)
Motivation
Many real-world text datasets are imbalanced and represented as sparse TF-IDF features. While popular over-sampling methods like SMOTE don’t support sparse matrices, under-sampling works seamlessly. This example gives users a concise, reproducible template for building and evaluating an imbalance-aware text pipeline with scikit-learn + imbalanced-learn.
Usage
Install optional deps:
pip install datasets matplotlib
Run the example (saves a confusion matrix PNG when --plot is used):
python examples/text_sentiment_svm_with_resampling.py --plot --max-samples 6000
Key outputs:
Balanced accuracy (printed)
Classification report (printed)
Confusion matrix image: confmat_svm_imblearn.png (when --plot is passed)
Notes:
--max-samples keeps runtime/disk reasonable; set None to use the full dataset.
The dataset labels are 0=negative, 1=neutral, 2=positive.
Tests
A small smoke test is included and can be run as:
pytest -q imblearn/tests/test_text_sentiment_example.py
The test:
Trains on a tiny slice of the dataset,
Verifies predictions are produced with the expected label set,
Uses pytest.importorskip("datasets") so it’s skipped if the optional dependency isn’t installed.
Implementation Notes
Chooses RandomUnderSampler because TF-IDF is sparse and SMOTE variants generally require dense/continuous features.
Uses LinearSVC for a strong, fast baseline on high-dimensional sparse text.
Reports balanced accuracy and macro F1 to reflect class imbalance better than plain accuracy.
Backward Compatibility
No changes to public APIs; example + test only.
Checklist
Example runs locally and produces metrics and (optionally) a confusion matrix PNG
Added minimal smoke test; passes locally
Code style follows project conventions (simple, documented, reproducible)
No public API changes