Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?

LLMs often benefit from verbalized reasoning, but it remains unclear which aspects of task difficulty are addressed by these extra reasoning tokens. To investigate this question, we formalize a framework using DFAs-- DFAs offer a formalism though which we can characterize task complexity with measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity).

We find the following:

Across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.

We investigate which properties of complexity govern this critical length: task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not.

We demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.

@misc{lee2025criticalthinkingkindscomplexity,
      title={Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?}, 
      author={Celine Lee and Alexander M. Rush and Keyon Vafa},
      year={2025},
      eprint={2504.01935},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.01935}, 
}

This repository includes all code for experiments.

Create .env file with OPENAI_API_KEY, TOGETHER_API_KEY.

Generate samples

. run_vllm.sh
. run_together.sh
. run_openai.sh

Extrapolation experiments

. extrapolate.sh

Make plots

. plot_all.sh

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
cruxeval		cruxeval
images		images
src		src
.gitignore		.gitignore
.models_env		.models_env
.tasks_env		.tasks_env
README.md		README.md
__init__.py		__init__.py
extrapolate.sh		extrapolate.sh
main.py		main.py
plot_all.sh		plot_all.sh
run_openai.sh		run_openai.sh
run_together.sh		run_together.sh
run_vllm.sh		run_vllm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?

This repository includes all code for experiments.

Generate samples

Extrapolation experiments

Make plots

About

Uh oh!

Releases

Packages

Languages

celine-lee/critical_thinking

Folders and files

Latest commit

History

Repository files navigation

Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?

This repository includes all code for experiments.

Generate samples

Extrapolation experiments

Make plots

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages