- Overview
- Key Features
- Technologies Used
- Implementation Steps
- Deployment
- Results
- Visual Results
- Contributors
This project is a comprehensive machine learning pipeline designed for Twitter sentiment analysis. Implemented using Recurrent Neural Networks (RNN) with a multi-layer Bidirectional Long Short-Term Memory (LSTM) architectures. The model leverages pre-trained embeddings and multiple LSTM layers to capture complex contextual dependencies in sequential text data, ensuring precise classification of tweets into positive, neutral, and negative sentiments.
- Advanced Data Preprocessing: Tokenization, stemming, lemmatization, and stop-word removal techniques are utilized for efficient text normalization.
- Dataset: The training data consisted of over 1.2 million samples, and the test data had approximately 350k samples. The dataset is based on data from the following two sources:
- University of Michigan Sentiment Analysis competition on Kaggle
- Twitter Sentiment Corpus by Niek Sanders
- RNN with Bidirectional LSTM Architecture: A multi-layered model architecture that combines several Bidirectional LSTM layers to enhance the model's ability to understand dependencies in both forward and backward directions. The final architecture includes stacked LSTM layers, dense layers for deeper representation learning, and regular dropout for improved generalization.
- Model Evaluation: Precision, recall and F1-score metrics are leveraged for performance analysis, ensuring the model generalizes well to unseen data.
- Scalable Deployment: Deployed the model using a simple Hugging Face platform for easy accessibility and integration.
- Programming Language: Python (optimized with NumPy and Pandas for data handling)
- Deep Learning Libraries: TensorFlow, Keras for model building and training
- NLP Libraries: NLTK, SpaCy for preprocessing and feature extraction
- Cloud Deployment: Hugging Face for deployment
- Data Collection and Preparation: Data sourced from established datasets and processed using Python libraries to handle noisy text data.
- Preprocessing Pipeline:
- Tokenization using NLTK
- Lemmatization for uniformity
- Removal of stop words and special characters
- Model Architecture:
- Sequential model using an embedding layer initialized with a pre-trained embedding matrix
- A series of Bidirectional LSTM layers:
- First layer with 128 units and
return_sequences=True
- Second layer with 64 units and
return_sequences=True
- Third layer with 32 units and
return_sequences=True
- First layer with 128 units and
- A final LSTM layer with 16 units and return_sequences=False
- Dense layers for non-linear transformations:
- A dense layer with 64 units and ReLU activation
- A dense layer with 32 units and ReLU activation
- Output layer with softmax activation for multi-class classification
- Dropout layers (20%) for regularization after each LSTM and dense layer
- Training Strategy:
- Stratified k-fold cross-validation for comprehensive model validation
- Optimized using Adam optimizer and a learning rate scheduler for adaptive learning
- Model Evaluation and Hyperparameter Tuning:
- Hyperparameters fine-tuned using grid search and Bayesian optimization
- Performance measured through confusion matrices and precision-recall curves
The final model was deployed using Hugging Face for seamless deployment and accessibility. The solution is exposed through a REST API endpoint, allowing integration with web applications and data analysis platforms.
- Precision and Recall: High precision and recall scores for positive and negative classes, highlighting the model's capability in sentiment differentiation.
Below is visual representations of the model's performance: