Skip to content

michaelchen-lab/JustLogic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

JustLogic is a deductive reasoning datataset that is

  1. highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures;
  2. prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and
  3. capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy.

This repository contains all code and prompts to reproduce the dataset, evaluations, and statistics in the paper.

Dataset Construction (Section 3)

python create_dataset/template.py -maxd 7 -mind 1 -dsample 1000 
  • maxd: Maximum argument depth
  • mind: Minimum argument depth
  • dsample: No. of samples per depth

To create train-test split, create_dataset/create_split.py can be used.

The full dataset can be found in the dataset folder. To prevent benchmark leakage, the test set is not openly released. However, it can be easily generated by running the above scripts. The test set is also available upon request to the authors.

Dataset Format

  • premises: List of premises in the question, in the form of a Python list.
  • paragraph: A paragraph consisting of the above premises. This is given as input to models.
  • conclusion: The expected conclusion of the given premises.
  • question: The statement in which models must determine its truth-value.
  • label: True | False | Uncertain
  • arg: The argument structure
  • statements: Matching symbols in arg to their corresponding natural language statements.
  • depth: The argument depth of the given question

Reproducing Experiments

Prior-Knowledge Independence Test (Section 5.1)

  • Run pki_eval/get_completions.py to get model predictions. OpenAI and OpenRouter models are supported.
  • Run pki_eval/eval.py to find the accuracy rate.

LLM Evaluation (Section 5.2)

  • To reproduce results, run eval/get_completions. OpenAI and OpenRouter models are supported.
  • Run eval/eval.py to find the accuracy rate.

Statistics

  • The statistics on complexity in Section 3.4 can be reproduced in statistics/complexity_stats.py.
  • The graphs on error analysis in Section 5.3 can be reproduced in statistics/error_analysis.py.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages