Skip to content

Conversation

IamSavitha
Copy link

Summary

Add a runnable example demonstrating 3-class text sentiment classification (negative/neutral/positive) with an imbalanced-learn pipeline:

TfidfVectorizer → RandomUnderSampler → LinearSVC

The example uses the tweet_eval/sentiment dataset and shows how to handle class imbalance in sparse text workflows. It prints a balanced-accuracy score and a classification report and can optionally save a confusion-matrix image.

Files added

examples/text_sentiment_svm_with_resampling.py

imblearn/tests/test_text_sentiment_example.py (small smoke test)

Motivation

Many real-world text datasets are imbalanced and represented as sparse TF-IDF features. While popular over-sampling methods like SMOTE don’t support sparse matrices, under-sampling works seamlessly. This example gives users a concise, reproducible template for building and evaluating an imbalance-aware text pipeline with scikit-learn + imbalanced-learn.

Usage

Install optional deps:

pip install datasets matplotlib

Run the example (saves a confusion matrix PNG when --plot is used):

python examples/text_sentiment_svm_with_resampling.py --plot --max-samples 6000

Key outputs:

Balanced accuracy (printed)

Classification report (printed)

Confusion matrix image: confmat_svm_imblearn.png (when --plot is passed)

Notes:

--max-samples keeps runtime/disk reasonable; set None to use the full dataset.

The dataset labels are 0=negative, 1=neutral, 2=positive.

Tests

A small smoke test is included and can be run as:

pytest -q imblearn/tests/test_text_sentiment_example.py

The test:

Trains on a tiny slice of the dataset,

Verifies predictions are produced with the expected label set,

Uses pytest.importorskip("datasets") so it’s skipped if the optional dependency isn’t installed.

Implementation Notes

Chooses RandomUnderSampler because TF-IDF is sparse and SMOTE variants generally require dense/continuous features.

Uses LinearSVC for a strong, fast baseline on high-dimensional sparse text.

Reports balanced accuracy and macro F1 to reflect class imbalance better than plain accuracy.

Backward Compatibility

No changes to public APIs; example + test only.

Checklist

Example runs locally and produces metrics and (optionally) a confusion matrix PNG

Added minimal smoke test; passes locally

Code style follows project conventions (simple, documented, reproducible)

No public API changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant