Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions examples/nlp/imdb_lstm_sentiment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
"""
Title: Sentiment analysis with LSTM on the IMDB dataset
Author:Madhur Jain
Date created: 2025/11/19
Last Modified: 2025/11/19
Description: A simple LSTM-based sentiment classifier trained on IMDB text reviews.
"""

"""
## Introduction

LSTM refers to Long short term memories, that is while predicting it not only keeps short term memory but also long term memory
LSTM uses sigmoid activation functions and tan-h activation functions:
The Sigmoid fn. ranges the values from 0 to 1,
tan-h function ranges the values from -1 to 1.
Doesn't let Gradient Descent of long term memory to vanish or short term memory to completely explode.
It contains 3 stages:
1st stage: Determines what % of long term memory is remembered- c/a Forget Gate
2nd stage: Determines how we would update long-term memory- c/a Input Gate
3rd stage: Updates short term memory and it is the output of the entire stage- c/a Output Gate

If you wanna know more deeply about it, I would recommend to watch Stanford Online: statistacl analysis with Python course lectures available on Youtube (for free)

"""
Comment on lines +9 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The introduction docstring is informal (e.g., "wanna", "c/a"), contains a typo ("statistacl"), and its explanation of LSTMs is a bit convoluted for a beginner's example. Keras example introductions should be concise and clearly state the example's purpose. Please rewrite this to be more formal and clear, similar to other Keras-IO tutorials.

"""
## Introduction

This example demonstrates how to perform binary sentiment analysis on the IMDB movie review dataset using a simple `Sequential` model with an `Embedding` layer and an `LSTM` layer.

The workflow is as follows:
1. Load the IMDB dataset.
2. Preprocess and tokenize the text reviews.
3. Build a `Sequential` model with `Embedding`, `LSTM`, and `Dense` layers.
4. Train the model.
5. Evaluate the model's performance.
6. Use the trained model to predict the sentiment of new, unseen reviews.
"""


import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
from keras import layers
from keras.models import Sequential
Comment on lines +29 to +32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script will fail to run because several modules are used without being imported. Please add the necessary imports at the top of the file. Note that some of my other suggestions might change which imports are needed.

import keras
import tensorflow as tf
from keras import layers
from keras.models import Sequential
import json
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


"""
## Load the dataset
get the kAGGLE.json from your kaggle account->settings->create new token
"""
kaggle_dictionary = json.load(open("kaggle.json")) #converts json object to python dictionary
#Setup Kaggle collection as env vars
kaggle_dictionary.keys()

os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]

# unzip the dataset file
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
zip_ref.extractall()

#loading the dataset
data = pd.read_csv("/content/IMDB Dataset.csv")
Comment on lines +38 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current data loading process is not suitable for a Keras example. It relies on a local kaggle.json file and has a hardcoded path (/content/IMDB Dataset.csv) that will only work in a specific Google Colab environment. Keras examples should be self-contained and runnable by any user without manual file setup.

Please refactor this to automatically download the dataset. A standard approach is to use keras.utils.get_file to fetch the data from a public URL. For the IMDB dataset, you can either use the pre-tokenized version from keras.datasets.imdb.load_data() or download the raw text from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz and process it, which would align better with showing a full text-vectorization pipeline.


data.shape

data.info()

data.head()

data["sentiment"].value_counts()

data.replace({"sentiment": {"positive": 1, "negative": 0}}, inplace=True)

data.head()

data["sentiment"].value_counts()


"""
## Splitting into Training and test set
"""
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

print(train_data.shape)
print(test_data.shape)


"""
## Data Processing
"""
#Tokenize text data
# for text data one have to tokenize(convert words to integer in short) the data and stuff
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)
Comment on lines +81 to +84
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR description mentions using TextVectorization, but the code uses the legacy keras.preprocessing.text.Tokenizer. The TextVectorization layer is the recommended approach in modern Keras. It can be included directly in your model, which makes the model end-to-end and simplifies inference, as it can process raw strings directly. Using it would also align the example with the PR description and other modern NLP examples in Keras.

You could replace the Tokenizer and pad_sequences logic with a TextVectorization layer that you adapt() on the training text.


print(X_train)
print(X_test)

Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]

print(Y_train)
print(Y_test)


"""
## LSTM (Long Short Term Memory) Model
"""
# build the model

model = Sequential() #sequential model
#add layers
model.add(Embedding(input_dim=5000, output_dim=128, input_shape=(200,)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input_shape argument is not necessary for the Embedding layer in this Sequential model. Keras can infer the input shape automatically. Removing it makes the code cleaner and more robust to changes in sequence length.

model.add(Embedding(input_dim=5000, output_dim=128))

model.add(LSTM(128, dropout=0.2))
model.add(Dense(1, activation="sigmoid"))

model.summary()

# compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
"""
## Training the Model
"""
model.fit(X_train, Y_train, epochs=5, batch_size=64, validation_split=0.2)

"""
## Model Evaluation
"""
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

"""
### Predicting Values
"""
def predict_sentiment(review):
# tokenize and pad the review
sequence = tokenizer.texts_to_sequences([review])
padded_sequence = pad_sequences(sequence, maxlen=200)
prediction = model.predict(padded_sequence)
sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
return sentiment


# examples

new_review = "This movie was fantastic. I loved it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
#===================================================================================#
new_review = "This movie was not that good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
#===================================================================================#
new_review = "Great movie but could have added a better action scene"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
#===================================================================================#
new_review = "Mid movie"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
# ==================================================================================#
new_review = "I laughing while shitting damn what a watch"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
Comment on lines +135 to +155
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The example reviews and comments in this section are not suitable for an official Keras example. The separator comments (#=====...) are noisy, and one of the reviews contains profanity. Please replace these with more professional examples and remove the unnecessary comment separators.

#  examples

new_review = "This movie was fantastic. I loved it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

new_review = "This movie was not that good"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

new_review = "Great movie but could have added a better action scene"
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

new_review = "It was a mediocre film, I would not recommend it."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")