This project focuses on Sentiment Analysis of Tweets using the popular Sentiment140 dataset.
The model predicts whether a tweet expresses a positive or negative sentiment by leveraging Natural Language Processing (NLP) techniques and a Logistic Regression classifier.
The pipeline includes:
- Text preprocessing (cleaning, stopword removal, and stemming)
- TF-IDF Vectorization for numerical feature extraction
- Model training using Logistic Regression
- Model evaluation on unseen test data
- Model persistence with
pickle
for future use
- Dataset: Sentiment140
- Size: 1.6 million tweets
- Target Variable:
0
β Negative Sentiment4
β Positive Sentiment (converted to1
in this project)
- Languages: Python 3.10
- Libraries:
numpy
,pandas
β Data handlingnltk
β Stopwords, stemmingscikit-learn
β TF-IDF, train-test split, Logistic Regressionpickle
β Model saving
- Environment: Google Colab
- Data Loading: Load the dataset with correct encoding (
latin-1
). - Data Cleaning:
- Remove unwanted characters, mentions, URLs, and punctuation.
- Apply stemming using
PorterStemmer
.
- Feature Extraction: Convert text to numerical vectors using TF-IDF.
- Train-Test Split: 80% training, 20% testing (stratified).
- Model Training: Logistic Regression with
max_iter=1000
. - Evaluation:
- Training Accuracy: ~80%
- Test Accuracy: ~77%
- No significant overfitting detected.
- Model Deployment: Save model as
trained_model.sav
for re-use.
Dataset | Accuracy |
---|---|
Training | 80.4% |
Testing | 77.7% |
# Load the model
import pickle
loaded_model = pickle.load(open('trained_model.sav', 'rb'))
# Predict sentiment
tweet = "I love this product! Absolutely amazing."
vectorized_tweet = vector.transform([tweet])
prediction = loaded_model.predict(vectorized_tweet)
print("Positive Tweet" if prediction[0] == 1 else "Negative Tweet")