self-explaining_llms

Implementation of the paper "Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models"

Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE) - extractive and counterfactual - using state-of-the-art LLMs (1B to 70B parameters) on three different classification tasks (both objective and subjective). In line with literature, our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model's decision process (i.e. are not faithful). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM's prediction. Our results suggest that extractive SE provide the LLM's "guess" at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM's reasoning: We show that the issue of validity can be resolved by sampling counterfactual candidates at high temperature - followed by a validity check - and introducing a formula to estimate the number of tries needed to generate valid explanations. This simple method produces plausible and valid explanations that offer a 16 times faster alternative to SHAP on average.

Usage

After cloning or downloading this repository, first run the Linux shell script ./setup.sh. It will initialize the workspace by performing the following steps:

It will install the required Python modules by running pip install -r "./requirements.txt"
It will download the necessary Python code to compute the BARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
It will download and preprocess the Food Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of the ERASER benchmark by DeYoung et al. (2020) to "./data/movies/"
It will download and preprocess the Toxic Spans Dataset by Pavlopoulos et al. (2022) to "./data/toxic spans/"

When preprocessing is finished, the experiments can be rerun using the shell script ./run.sh which will run each of the following python files with all tested LLMs in turn:

python ./ablation-probability-hazard.py
python ./ablation-saliency-hazard.py
python ./self-explanations-hazard.py
python ./self-explanations-movies.py
python ./self-explanations-toxic.py
python ./self-counterfactuals-hazard.py
python ./self-counterfactuals-movies.py
python ./self-counterfactuals-toxic.py

Originally, the experiments were performed using Python 3.10.12 on 8 NVIDIA RTX A5500 graphics cards with 24GB of memory each. Finally, the Jupyter Notebooks evaluate-hazard.ipynb, evaluate-movies.ipynb, etc. can be used to analyze the results.

Sources

Yuan, W., Neubig, G., & Liu, P. (2021). BARTScore: Evaluating Generated Text as Text Generation. ArXiv.

Randl, K., Karvounis, M., Marinos, G., Pavlopoulos, J., Lindgren, T., & Henriksson, A. (2024). Food Recall Incidents [Data set]. Zenodo.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Zaidan, O. and Eisner, J. (2008). Modeling Annotators: A Generative Approach to Learning from Annotator Rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40, Honolulu, Hawaii. Association for Computational Linguistics.

Pavlopoulos, J., Laugier, L., Xenos, A., Sorensen, J., and Androutsopoulos, I. (2022), From the Detection of Toxic Spans in Online Discussions to the Analysis of Toxic-to-Civil Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3721–3734, Dublin, Ireland. Association for Computational Linguistics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

self-explaining_llms

Usage

Sources

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation-general-labels.ipynb		ablation-general-labels.ipynb
ablation-probability-hazard.py		ablation-probability-hazard.py
ablation-saliency-hazard.py		ablation-saliency-hazard.py
evaluate-ablations.ipynb		evaluate-ablations.ipynb
evaluate-hazard-counterfactuals.ipynb		evaluate-hazard-counterfactuals.ipynb
evaluate-hazard.ipynb		evaluate-hazard.ipynb
evaluate-movies-counterfactuals.ipynb		evaluate-movies-counterfactuals.ipynb
evaluate-movies.ipynb		evaluate-movies.ipynb
evaluate-toxic-counterfactuals.ipynb		evaluate-toxic-counterfactuals.ipynb
evaluate-toxic.ipynb		evaluate-toxic.ipynb
requirements.txt		requirements.txt
run.sh		run.sh
self-counterfactuals-hazard.py		self-counterfactuals-hazard.py
self-counterfactuals-movies.py		self-counterfactuals-movies.py
self-counterfactuals-toxic.py		self-counterfactuals-toxic.py
self-explanations-hazard.py		self-explanations-hazard.py
self-explanations-movies.py		self-explanations-movies.py
self-explanations-toxic.py		self-explanations-toxic.py
setup.sh		setup.sh

License

k-randl/self-explaining_llms

Folders and files

Latest commit

History

Repository files navigation

self-explaining_llms

Usage

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages