Chatbot Usage Analysis

The notebook analysis/explore.ipynb performs an in-depth analysis of user interactions with an AI chatbot. The analysis includes generating concise summaries of the general intentions and key themes of the questions asked by users, exploratory analysis of the messages (e.g. word clouds, plotting language distributions and user engagement statistics), and a focused study on negative chatbot responses.

analysis/explore_with_results.html contains an exported version of the notebook including generated plots and results.

Here is an outline of the different parts of the notebook.

📦 Dataset Loading and Preprocessing

Load usage data from an Excel file.
Requests are in serialized Python dict format (not JSON). We:
- Attempt to parse with ast.literal_eval.
- Fallback to repairing the string with json_repair when needed.
Extract history from each request and isolate only user messages.

🧠 LLM-Powered Summarization

We use deepseek-ai/DeepSeek-R1-Distill-Qwen-14B and a local vLLM server to generate insights from user messages.

Steps:

Tokenize user messages using HuggingFace tokenizer to estimate total number of tokens.
Chunk messages into manageable sizes (≤50k tokens).
Construct prompts and generate summaries per chunk.
Save each chunk summary to summary_results.json.
Combine all summaries to generate a final, global summary of user intents and topics, saved in global_summary.txt.

🔍 NLP Exploratory Analysis

🔡 Word Cloud of User Messages

Compute TF-IDF scores of top terms.
Visualize with wordcloud.

🌍 Language Distribution

Detect languages of user messages using langdetect.
Plot frequency of detected languages.

🧠 Topic Modeling (WIP)

Use BERTopic to extract topics from user messages.
Visualize the top 10 topics (currently experimental, potentially needs more tweaking).

📈 User Engagement Analysis

⏰ Message Timing

Plot number of messages by hour of the day with 6-hour bins.
Plot number of messages by day of the week.

💬 Session Length

Plot number of messages per chat session.

👤 User Activity

Bucket users by number of messages sent.
Display distribution via pie chart.

🚨 Negative Response Flow Analysis

Analyze how the chatbot handles negative responses and how users react:

Methodology:

The current process is based on lexical rules - can be made more intelligent with LLMs if needed.
Check if the bot says "I could not find" or "unable to find".
Detect if suggestions (<<...>>) are offered.
Analyze follow-up behavior:
- User follows suggestion
- User asks something else
- The bot fails to answer even if user follows one of the suggestions (ideally should not happen, but in fact happens 66% of the time)

Visualizations:

Histogram: Negative response count per session
Sankey Diagram: Flow from negative response to user action

🛠️ Setup & Requirements

Copy Usage_Data.xlsx to data/ directory.
Install required dependencies:

pip install -r requirements.txt

To run the summarization section, you also need:

A local vLLM instance running an LLM (feel free to change it to any model supported by vLLM):

VLLM_USE_V1=0 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --max-log-len 10 --max-model-len 100000 --enable-reasoning --reasoning-parser deepseek_r1

📁 File Structure

.
├── requirements.txt                   # Python dependencies for running the project
├── data/
│   ├── global_summary.txt             # LLM-generated global summary of all user messages
│   ├── Usage_Data.xlsx        # Main dataset used for analysis (not included in the repo)
│   └── summary_results.json           # LLM-generated summaries of chunks of user messages
├── analysis/
│   ├── utils.py                       # OpenAI wrapper for vLLM
│   ├── explore.ipynb                  # Main exploratory data analysis notebook
│   └── explore_with_results.html      # Static HTML export of the results notebook
├── README.md                          # This project overview and documentation
└── LICENSE                            # License for usage and distribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chatbot Usage Analysis

📦 Dataset Loading and Preprocessing

🧠 LLM-Powered Summarization

Steps:

🔍 NLP Exploratory Analysis

🔡 Word Cloud of User Messages

🌍 Language Distribution

🧠 Topic Modeling (WIP)

📈 User Engagement Analysis

⏰ Message Timing

💬 Session Length

👤 User Activity

🚨 Negative Response Flow Analysis

Methodology:

Visualizations:

🛠️ Setup & Requirements

📁 File Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

nilesh-c/chatbot_explore

Folders and files

Latest commit

History

Repository files navigation

Chatbot Usage Analysis

📦 Dataset Loading and Preprocessing

🧠 LLM-Powered Summarization

Steps:

🔍 NLP Exploratory Analysis

🔡 Word Cloud of User Messages

🌍 Language Distribution

🧠 Topic Modeling (WIP)

📈 User Engagement Analysis

⏰ Message Timing

💬 Session Length

👤 User Activity

🚨 Negative Response Flow Analysis

Methodology:

Visualizations:

🛠️ Setup & Requirements

📁 File Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages