JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

JustLogic is a deductive reasoning datataset that is

highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures;
prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and
capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy.

This repository contains all code and prompts to reproduce the dataset, evaluations, and statistics in the paper.

Dataset Construction (Section 3)

python create_dataset/template.py -maxd 7 -mind 1 -dsample 1000

maxd: Maximum argument depth
mind: Minimum argument depth
dsample: No. of samples per depth

To create train-test split, create_dataset/create_split.py can be used.

The full dataset can be found in the dataset folder. To prevent benchmark leakage, the test set is not openly released. However, it can be easily generated by running the above scripts. The test set is also available upon request to the authors.

Dataset Format

premises: List of premises in the question, in the form of a Python list.
paragraph: A paragraph consisting of the above premises. This is given as input to models.
conclusion: The expected conclusion of the given premises.
question: The statement in which models must determine its truth-value.
label: True | False | Uncertain
arg: The argument structure
statements: Matching symbols in arg to their corresponding natural language statements.
depth: The argument depth of the given question

Reproducing Experiments

Prior-Knowledge Independence Test (Section 5.1)

Run pki_eval/get_completions.py to get model predictions. OpenAI and OpenRouter models are supported.
Run pki_eval/eval.py to find the accuracy rate.

LLM Evaluation (Section 5.2)

To reproduce results, run eval/get_completions. OpenAI and OpenRouter models are supported.
Run eval/eval.py to find the accuracy rate.

Statistics

The statistics on complexity in Section 3.4 can be reproduced in statistics/complexity_stats.py.
The graphs on error analysis in Section 5.3 can be reproduced in statistics/error_analysis.py.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
create_dataset		create_dataset
dataset		dataset
eval		eval
pki_eval		pki_eval
statistics		statistics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
dataset.csv		dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Dataset Construction (Section 3)

Dataset Format

Reproducing Experiments

Prior-Knowledge Independence Test (Section 5.1)

LLM Evaluation (Section 5.2)

Statistics

About

Uh oh!

Releases

Packages

Languages

License

michaelchen-lab/JustLogic

Folders and files

Latest commit

History

Repository files navigation

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

Dataset Construction (Section 3)

Dataset Format

Reproducing Experiments

Prior-Knowledge Independence Test (Section 5.1)

LLM Evaluation (Section 5.2)

Statistics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages