This project aims to predict customer churn for a bank using machine learning models. The dataset contains information about the bank's customers and various features related to their transactions, demographics, and account activity. The main objective is to build and tune machine learning models to accurately predict whether a customer will churn or not.
- Large-scale dataset: 355,190 records × 116 features
- Extensive feature selection using correlation, SHAP, and LIME
- Trained Logistic Regression & SVM with hyperparameter tuning (GridSearchCV)
- Deployed with Flask + Gunicorn + Streamlit UI for real-time predictions
- Introduction
- Dataset
- Data Preprocessing
- Feature Engineering
- Model Training and Tuning
- Model Evaluation
- Interpretability
- Deployment
- Usage
- Results
- Conclusion
Customer churn is a critical issue for banks, as retaining existing customers is often more cost-effective than acquiring new ones. This project leverages machine learning to predict which customers are likely to churn based on their historical data and behavior patterns.
- Records: 355,190
- Features: 116
- Target variable:
TARGET→ 1 (churned), 0 (retained) - Data includes: Demographics, product usage, account activity, and more
- Handled missing values and duplicates
- One-hot encoded categorical variables
- Normalized numerical columns
- Split into training and test sets
Significant features were identified through various techniques, including correlation analysis, SHAP, and LIME. The top features selected for the model included:
REST_AVG_CURLDEAL_ACT_DAYS_PCT_AAVGREST_DYNAMIC_IL_3MCR_PROD_CNT_IL_5CR_PROD_CNT_TOVR_4REST_DYNAMIC_CUR_1MCR_PROD_CNT_TOVR_5CR_PROD_CNT_PIL_4TURNOVER_DYNAMIC_IL_3MTURNOVER_DYNAMIC_IL_1MAPP_MARITAL_STATUS_Civil UnionCR_PROD_CNT_CC_9PACK_109CR_PROD_CNT_VCU_3CR_PROD_CNT_TOVR_6
Trained the following models:
- Logistic Regression
- Support Vector Machine (SVM)
Used GridSearchCV for hyperparameter tuning
Evaluated with:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC
The interpretability of the models was analyzed using LIME. These methods provided insights into the most important features driving the predictions:
- LIME (Local Interpretable Model-agnostic Explanations) was used to explain individual predictions by approximating the model locally.
- Backend: Flask app running with Gunicorn
- Frontend: Streamlit UI for real-time predictions
- Input form for customer details → instant churn prediction in real-time
To run the project locally:
- Clone the repository.
- Install the required dependencies.
- Run the Flask app using Gunicorn.
- Access the Streamlit interface to input customer data and view predictions.
| Model | Accuracy | Precision | Recall | F1 Score | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | ~78% | 0.76 | 0.72 | 0.74 | 0.76 |
| SVM | ~80% | 0.77 | 0.79 | 0.78 | 0.80 |
The Support Vector Machine model was selected for deployment due to its higher recall and ROC-AUC, making it more effective for minimizing false negatives in churn prediction.
This project demonstrates the effectiveness of machine learning in predicting customer churn. By understanding the key features contributing to churn, banks can develop targeted strategies to retain customers and reduce churn rates.