Implementation of the paper "Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models"
Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE) - extractive and counterfactual - using state-of-the-art LLMs (1B to 70B parameters) on three different classification tasks (both objective and subjective). In line with literature, our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model's decision process (i.e. are not faithful). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM's prediction. Our results suggest that extractive SE provide the LLM's "guess" at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM's reasoning: We show that the issue of validity can be resolved by sampling counterfactual candidates at high temperature - followed by a validity check - and introducing a formula to estimate the number of tries needed to generate valid explanations. This simple method produces plausible and valid explanations that offer a 16 times faster alternative to SHAP on average.
After cloning or downloading this repository, first run the Linux shell script ./setup.sh
.
It will initialize the workspace by performing the following steps:
- It will install the required Python modules by running
pip install -r "./requirements.txt"
- It will download the necessary Python code to compute the BARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
- It will download and preprocess the Food Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
- It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of the ERASER benchmark by DeYoung et al. (2020) to "./data/movies/"
- It will download and preprocess the Toxic Spans Dataset by Pavlopoulos et al. (2022) to "./data/toxic spans/"
When preprocessing is finished, the experiments can be rerun using the shell script ./run.sh
which will run each of the following python files with all tested LLMs in turn:
python ./ablation-probability-hazard.py
python ./ablation-saliency-hazard.py
python ./self-explanations-hazard.py
python ./self-explanations-movies.py
python ./self-explanations-toxic.py
python ./self-counterfactuals-hazard.py
python ./self-counterfactuals-movies.py
python ./self-counterfactuals-toxic.py
Originally, the experiments were performed using Python 3.10.12 on 8 NVIDIA RTX A5500 graphics cards with 24GB of memory each.
Finally, the Jupyter Notebooks evaluate-hazard.ipynb
, evaluate-movies.ipynb
, etc. can be used to analyze the results.