Skip to content

Official implementation of the papers "Evaluating the Reliability of Self-Explanations in Large Language Models" and "Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models"

License

Notifications You must be signed in to change notification settings

k-randl/self-explaining_llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

self-explaining_llms

Implementation of the paper "Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models"

Abstract: This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE) - extractive and counterfactual - using state-of-the-art LLMs (1B to 70B parameters) on three different classification tasks (both objective and subjective). In line with literature, our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model's decision process (i.e. are not faithful). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM's prediction. Our results suggest that extractive SE provide the LLM's "guess" at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM's reasoning: We show that the issue of validity can be resolved by sampling counterfactual candidates at high temperature - followed by a validity check - and introducing a formula to estimate the number of tries needed to generate valid explanations. This simple method produces plausible and valid explanations that offer a 16 times faster alternative to SHAP on average.

Usage

After cloning or downloading this repository, first run the Linux shell script ./setup.sh. It will initialize the workspace by performing the following steps:

  1. It will install the required Python modules by running pip install -r "./requirements.txt"
  2. It will download the necessary Python code to compute the BARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
  3. It will download and preprocess the Food Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
  4. It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of the ERASER benchmark by DeYoung et al. (2020) to "./data/movies/"
  5. It will download and preprocess the Toxic Spans Dataset by Pavlopoulos et al. (2022) to "./data/toxic spans/"

When preprocessing is finished, the experiments can be rerun using the shell script ./run.sh which will run each of the following python files with all tested LLMs in turn:

  • python ./ablation-probability-hazard.py
  • python ./ablation-saliency-hazard.py
  • python ./self-explanations-hazard.py
  • python ./self-explanations-movies.py
  • python ./self-explanations-toxic.py
  • python ./self-counterfactuals-hazard.py
  • python ./self-counterfactuals-movies.py
  • python ./self-counterfactuals-toxic.py

Originally, the experiments were performed using Python 3.10.12 on 8 NVIDIA RTX A5500 graphics cards with 24GB of memory each. Finally, the Jupyter Notebooks evaluate-hazard.ipynb, evaluate-movies.ipynb, etc. can be used to analyze the results.

Sources

Yuan, W., Neubig, G., & Liu, P. (2021). BARTScore: Evaluating Generated Text as Text Generation. ArXiv.

Randl, K., Karvounis, M., Marinos, G., Pavlopoulos, J., Lindgren, T., & Henriksson, A. (2024). Food Recall Incidents [Data set]. Zenodo.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Zaidan, O. and Eisner, J. (2008). Modeling Annotators: A Generative Approach to Learning from Annotator Rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40, Honolulu, Hawaii. Association for Computational Linguistics.

Pavlopoulos, J., Laugier, L., Xenos, A., Sorensen, J., and Androutsopoulos, I. (2022), From the Detection of Toxic Spans in Online Discussions to the Analysis of Toxic-to-Civil Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3721–3734, Dublin, Ireland. Association for Computational Linguistics.

About

Official implementation of the papers "Evaluating the Reliability of Self-Explanations in Large Language Models" and "Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages