Homework # 4 for the course Advanced Data Mining and Language Technologies at La Sapienza University of Rome
The assignment consists in the analysis of customer ratings and comments for a set of products and constructing a language model that can classify a customer's comments as negative or positive.
Of the following dataset we quantize the 4 possible ratings into a binary feature (positive or negative comment) that we use as a label for implementing supervised models and consider title and review_text as the only informative features for classification.
The model we prefer is a Neural Network model based on a BERT pre-trained model for the embedding part fine-tuned with a simple Feedfoward Neural Network.
In the first part of the homework we try different combinations of encoding techniques and machine learning models to compare them. Go to the notebook for further information abouot our choices.
| TF-IDF + Complement Naive Bayes | Word2Vec + RandomForest | BERT + XGBoost |
|---|---|---|
![]() |
![]() |
![]() |
XGBoost in combination with BERT embeddings seems to slightly outperform the other methods.
In the second part we report our final model improving the best from the previous study by proceeding with a transfer learning technique. In fact we use BERT embeddings in combination with a simple FNN.
The last proposed model achieves excellent performance compared to the previous ones demonstrating the relevance of deep learning models in language processing (despite the fact that our study is based on less advanced models with respect to RNNs or transformers).
| Evaluation metrics |
|---|
![]() |
| Performance |
|---|
![]() |




