- Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models - arXiv 2020. (Updated version on Cambridge.org)
- Shortcut learning in deep neural networks - Nature Machine Intelligence 2020.
- Measure and Improve Robustness in NLP Models: A Survey - NAACL 2022.
- Shortcut Learning of Large Language Models in Natural Language Understanding - Communications of the ACM (CACM) 2023
| Paper | Add/Edit Level | Creation Method | Original Dataset | Naturalness |
|---|---|---|---|---|
| Consistency Ribeiro et al. 2019 | question | auto | SQuAD | Yes |
| Contrast Sets Gardner et al. 2020 | word | experts | DROP, Quoref, .. | No |
| SAM Schlegel et al. 2021 | word | auto | SQuAD, HotpotQA, DROP, NewsQA | No |
| Break, Perturb, Build Geva et al. 2022 | question | auto | DROP, HotpotQA, IIRC | Yes |
| Unanswerable Questions | ||||
| Not-answerable Questions Nakanishi et al. 2018 | context | auto | SQuAD | Yes |
| SQuADRUn Rajpurkar et al. 2018 | question | crowdworkers | SQuAD | Yes |
| Disconnected Reasoning Trivedi et al. 2020 | context | auto | HotpotQA | Yes |
| MuSiQue Trivedi et al. 2022 | context | auto | MuSiQue-Ans | Yes |
| Paper | Form | Purpose | Task | Github | Dataset | Note |
|---|---|---|---|---|---|---|
| Inoue et al. 2020 | Triple | Evaluation & Training | Derivation generation | URL | R4C | based on HotpotQA |
| Ho et al. 2020 | Triple | Evaluation & Training | Evidence generation | URL | 2WikiMultiHopQA | |
| Wolfson et al. 2020 | QDMR | Training | - | URL | Break it down | based on ten datasets (e.g., HotpotQA & DROP) |
| Tang et al. 2021 | Sub-question | Evaluation | QA about sub-questions | URL | 1000 samples | based on HotpotQA |
| Geva et al. 2021 | Sub-question | Evaluation & Training | QA about sub-questions | URL | StrategyQA | implicit questions |
| Ho et al. 2022 | Sub-question | Evaluation & Training | QA about sub-questions | URL | HieraDate | only for comparison about Date information |
| Trivedi et al. 2022 | Sub-question | Evaluation & Training | QA about sub-questions | URL | MuSiQue | |
| Dalvi et al. 2021 | Entailment Tree | Evaluation & Training | tree generation | URL | EntailmentBank | based on ARC and WorldTree V2 |
| Ribeiro et al. 2023 | a graph | Evaluation & Training | graph generation | URL | STREET | based on ARC, SCONE, GSM8K, AQUA-RAT, and AR-LSAT |