Project Version: 1.0.0 Last Updated: May 19, 2025 Author/Team: Eda AYDIN / NeuroQuantix
- LLM/NLP Analysis of Cognitive States using ZuCo Dataset
This project explores the application of Natural Language Processing (NLP) and Large Language Model (LLM) techniques to differentiate between cognitive states, specifically Normal Reading (NR) and Task-Specific Reading (TSR). The analysis is performed on a subset of the ZuCo dataset, focusing on linguistic characteristics of sentences read under these two conditions. The project progresses from baseline machine learning models with traditional NLP features to advanced techniques involving fine-tuned transformer embeddings and more complex neural network architectures.
The primary objective is to investigate and identify linguistic features and model architectures that can accurately classify sentences based on the cognitive state (NR or TSR) associated with their reading. This involves comparing various feature engineering strategies and machine learning models to determine the most effective approach for this classification task.
Understanding the linguistic markers of different cognitive states holds significant value for cognitive science by providing insights into how language processing varies with cognitive load and task demands. In NLP, this research can contribute to developing more context-aware language understanding systems. Potential applications include adaptive educational tools, diagnostic aids for reading comprehension, and enhanced human-computer interaction.
The project utilizes sentence data from the ZuCo dataset, specifically from CSV files corresponding to Normal Reading (NR) and Task-Specific Reading (TSR) tasks. Sentences were extracted from files named nr_*.csv and tsr_*.csv located in the task_materials/ directory.
- Loading: Sentences from NR and TSR CSV files were loaded.
- Cleaning: Text was converted to lowercase, and extra whitespace was removed.
- Uniqueness & Overlap Handling: Unique sentences for each condition were identified. Sentences common to both NR and TSR unique lists (61 sentences) were removed to ensure distinct datasets for classification.
- Label Encoding: NR was encoded as 0, and TSR as 1.
- Final Dataset: The processed dataset (
zuco_processed_sentences.csv) contains 635 unique sentences (304 NR, 331 TSR).
Three main feature sets were developed:
- Sentence Embeddings (Off-the-shelf):
all-MiniLM-L6-v2model was used to generate 384-dimensional embeddings (sentence_embeddings.npy). - Discrete Linguistic Features:
- Base Discrete: Character/word counts, average word length, Type-Token Ratio (TTR), lexical density proxy, and readability scores (Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog) (
base_discrete_features.csv). - Enhanced Discrete: Base features augmented with spaCy-derived syntactic features (clause counts, dependency distances, POS tag counts) (
enhanced_discrete_features.csv). - (Experimental) LLM Metrics: An
ollama_llm_rating(1-5 complexity) was added (simulated in the primary notebook run).
- Base Discrete: Character/word counts, average word length, Type-Token Ratio (TTR), lexical density proxy, and readability scores (Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog) (
- Fine-tuned Transformer Embeddings: A
bert-base-uncasedmodel was fine-tuned on the NR/TSR classification task using 5-fold cross-validation. The best fold's model was used to extract 768-dimensional embeddings. - Combined Features: Fine-tuned BERT embeddings were concatenated with scaled enhanced discrete features, resulting in 787-dimensional vectors for the advanced MLP model.
-
Baseline Models: Logistic Regression, Decision Tree, Random Forest, LightGBM, SVM, and a scikit-learn MLP were evaluated on off-the-shelf embeddings and discrete feature sets. Data was split 80/20 (train/test), with discrete features scaled using
StandardScalerwhere appropriate. -
Fine-tuned BERT: Trained using 5-fold StratifiedKFold CV with early stopping.
-
Advanced MLP (PyTorch): A custom MLP (Input -> 512 -> 128 -> 2) with ReLU, BatchNorm, and Dropout was trained on the combined fine-tuned embeddings and scaled enhanced discrete features.
-
Evaluation Metrics: Primary metric was F1-score for the TSR class. Accuracy, macro F1, and confusion matrices were also used.
The strategy focused on improving representation learning via fine-tuned transformer embeddings and enhancing model architecture by combining these with engineered features in an advanced MLP.
-
Baseline Models: Random Forest on enhanced discrete features initially performed best (F1-TSR: 0.7445).
-
Fine-tuned BERT (CV): Achieved a mean F1-TSR of 0.7434, with the best fold reaching 0.7939.
-
Advanced MLP (Test Set): The PyTorch MLP trained on combined fine-tuned BERT embeddings and scaled enhanced discrete features yielded the highest performance:
- F1-score (TSR class): 0.9474
- Accuracy: 0.9449
- F1-score (Macro): 0.9448
This project was developed using Python 3.12. Key libraries are listed in llm_analysis2.ipynb and include:
pandas
numpy
glob
os
re
time
matplotlib
seaborn
nltk
scikit-learn
sentence-transformers
textstat
spacy
torch
transformers
datasets
lightgbm
# ollama (for experimental LLM metrics)
It's recommended to set up a virtual environment:
python -m venv zuco_env
source zuco_env/bin/activate # On Windows: zuco_env\Scripts\activate
pip install -r requirements.txt # Assuming you create a requirements.txt fileDownload necessary NLTK and spaCy resources:
import nltk
import spacy
nltk.download('punkt', quiet=True)
# If en_core_web_sm is not present:
# python -m spacy download en_core_web_smFor the experimental Ollama features, ensure the Ollama server is running and the specified model (e.g., llama3.2) is pulled.
.
├── llm_analysis2.ipynb # Main Jupyter Notebook with the analysis
├── task_materials/ # Directory containing the input CSV files (nr_*.csv, tsr_*.csv)
│ ├── nr_*.csv
│ └── tsr_*.csv
├── zuco_processed_sentences.csv # Output: Processed sentences and labels
├── sentence_embeddings.npy # Output: Off-the-shelf sentence embeddings
├── base_discrete_features.csv # Output: Base discrete linguistic features
├── enhanced_discrete_features.csv # Output: Enhanced discrete features (including spaCy and simulated Ollama)
├── model_performance_summary.csv # Output: Summary of baseline model performances
├── final_model_performance_summary.csv # Output: Summary including advanced MLP performance
├── best_mlp_combined_features_ZuCo.bin # Output: Saved weights for the best PyTorch MLP model
├── sentence_length_distributions.png # Output: EDA plot
├── sentence_length_boxplots.png # Output: EDA plot
└── README.md # This file
- Ensure all dependencies are installed (see Setup and Installation).
- Place the ZuCo dataset CSV files (e.g.,
nr_S1.csv,tsr_S1.csv) into thetask_materials/directory. - Run the
llm_analysis2.ipynbnotebook cell by cell.- Note: The BERT fine-tuning (Section 5.1.3) and actual Ollama calls (Section 3.3.3, if
OLLAMA_ENABLEDis set toTrueand simulation block is removed) can be very time-consuming. The notebook is currently set to use simulated Ollama ratings for speed.
- Note: The BERT fine-tuning (Section 5.1.3) and actual Ollama calls (Section 3.3.3, if
- Outputs, including processed data, features, model performance summaries, and saved models/plots, will be generated in the root directory.
- Computational Resources: Fine-tuning transformers and LLM experiments are resource-intensive.
- Feature Engineering: Extracting comprehensive discrete features is complex.
- Data Leakage Potential: The fine-tuned BERT embeddings for the final MLP were derived from a BERT model fine-tuned via CV on the entire dataset. This means the final MLP test set was indirectly "seen" by the BERT model, potentially inflating metrics. A stricter hold-out set from the very beginning is recommended for future iterations.
- Hyperparameter Tuning: Exhaustive tuning was not the primary focus.
- Implement a strict, initially separated test set for final evaluation.
- Conduct advanced hyperparameter optimization for BERT fine-tuning and the MLP.
- Explore alternative transformer architectures (e.g., RoBERTa, DeBERTa).
- Investigate sophisticated feature combination strategies (e.g., attention mechanisms).
- Perform full integration and evaluation of actual Ollama-based LLM complexity metrics.
- Conduct a deeper statistical analysis of the most discriminative linguistic features.
- Eda AYDIN. (2025). LLM/NLP Analysis of Cognitive States using ZuCo Dataset (Unpublished Jupyter Notebook
llm_analysis2.ipynb). - Hollenstein, N., Troendle, M., Langer, N., & Zhang, C. (2021). ZuCo 2.0: A Dataset of Physiological Recordings During Natural Reading and Task-Specific Reading. OSF. Retrieved from https://osf.io/2urht/
- Yao, S., Zhao, J., Yu, D., Liu, Z., & Sha, L. (2022). Exploring the Relationship Between Eye Movements and Cognitive Workload in Natural Reading. Frontiers in Psychology, 13, 1028824. https://www.google.com/search?q=https://doi.org/10.3389/fpsyg.2022.1028824
- Hugging Face Transformers. (2025). Retrieved from https://huggingface.co/transformers
- Sentence Transformers. (2025). Retrieved from https://www.sbert.net
- spaCy. (2025). Retrieved from https://spacy.io
- PyTorch. (2025). Retrieved from https://pytorch.org
- Scikit-learn. (2025). Retrieved from https://scikit-learn.org
- Ollama. (2025). Retrieved from https://ollama.com
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.