I have a dataset which contains both numeric and categorical variables. I want to combine oversampling and under-sampling together. SMOTEOMEK is only applicable to pure numeric dataset.
model_oversampler_smotenc = make_pipeline(
SMOTENC(random_state=44, categorical_features= category_cols),
TomekLinks(sampling_strategy='auto'),
GradientBoostingClassifier())
scoring=['balanced_accuracy', 'f1', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=3)
cv_results_oversampler_smotenc = cross_validate(
model_oversampler_smotenc, data_train , target_train, scoring=scoring,
return_train_score=True, return_estimator=True, cv=cv,
n_jobs=-1)
print(
f"Balanced accuracy mean +/- std. dev.: "
f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].mean():.3f} +/- "
f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].std():.3f}"