JustLogic is a deductive reasoning datataset that is
- highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures;
- prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and
- capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy.
This repository contains all code and prompts to reproduce the dataset, evaluations, and statistics in the paper.
python create_dataset/template.py -maxd 7 -mind 1 -dsample 1000
maxd: Maximum argument depthmind: Minimum argument depthdsample: No. of samples per depth
To create train-test split, create_dataset/create_split.py can be used.
The full dataset can be found in the dataset folder. To prevent benchmark leakage, the test set is not openly released. However, it can be easily generated by running the above scripts. The test set is also available upon request to the authors.
premises: List of premises in the question, in the form of a Python list.paragraph: A paragraph consisting of the abovepremises. This is given as input to models.conclusion: The expected conclusion of the given premises.question: The statement in which models must determine its truth-value.label: True | False | Uncertainarg: The argument structurestatements: Matching symbols inargto their corresponding natural language statements.depth: The argument depth of the given question
- Run
pki_eval/get_completions.pyto get model predictions. OpenAI and OpenRouter models are supported. - Run
pki_eval/eval.pyto find the accuracy rate.
- To reproduce results, run
eval/get_completions. OpenAI and OpenRouter models are supported. - Run
eval/eval.pyto find the accuracy rate.
- The statistics on complexity in Section 3.4 can be reproduced in
statistics/complexity_stats.py. - The graphs on error analysis in Section 5.3 can be reproduced in
statistics/error_analysis.py.