LLMs often benefit from verbalized reasoning, but it remains unclear which aspects of task difficulty are addressed by these extra reasoning tokens. To investigate this question, we formalize a framework using DFAs-- DFAs offer a formalism though which we can characterize task complexity with measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity).
We find the following:
- Across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.
- We investigate which properties of complexity govern this critical length: task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not.
- We demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
@misc{lee2025criticalthinkingkindscomplexity,
title={Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?},
author={Celine Lee and Alexander M. Rush and Keyon Vafa},
year={2025},
eprint={2504.01935},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.01935},
}
Create .env file with OPENAI_API_KEY, TOGETHER_API_KEY.
. run_vllm.sh
. run_together.sh
. run_openai.sh
. extrapolate.sh
. plot_all.sh



