[tweet (breif overview of the paper)]
🙁 LLMs are overconfident even when they are dead wrong.
🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”?
❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.
# clone the repository
pip install -e lm-eval-harness
pip install -e evalchemy
pip install vllm
- For reasoning models that reliably generate "Confidence Reasoning":
bash evalchemy/scripts/reasoning_no_force.sh
- For reasoning models that do not reliably generate "Confidence Reasoning" (R1 Distill, OR1-Preview, GLM Z1):
bash evalchemy/scripts/reasoning_force.sh
- For non-reasoning models:
bash evalchemy/scripts/non_reasoning.sh
- Finally, use the notebook
results/calculate_metrics.ipynbto calculate ECE, Brier Score, and AUROC for the outputs.
- For reasoning models:
bash evalchemy/reasoning_slope.sh
- For non-reasoning models:
bash evalchemy/non_reasoning_slope.sh
- Finally, use the notebook
results/linear_regression.ipynbto run linear regression on the calibration metrics.
Change the dataset path and the model name appropriately referring to the list below.
Paths to the segmented CoTs
Reasoning Models
- DKYoon/qwen3-think-nonambigqa-slope
- DKYoon/qwen3-think-triviaqa-slope
- DKYoon/r1-nonambigqa-slope
- DKYoon/r1-triviaqa-slope
- DKYoon/exaone-deep-nonambigqa-slope
- DKYoon/exaone-deep-triviaqa-slope
- DKYoon/glm-z1-nonambigqa-slope
- DKYoon/glm-z1-triviaqa-slope
Non-Reasoning Models
- DKYoon/qwen3-non-think-nonambigqa-slope
- DKYoon/qwen3-non-think-triviaqa-slope
- DKYoon/glm-instruct-nonambigqa-slope
- DKYoon/glm-instruct-triviaqa-slope
- DKYoon/exaone-instruct-nonambigqa-slope
- DKYoon/exaone-instruct-triviaqa-slope
- DKYoon/qwen25-nonambigqa-slope
- DKYoon/qwen25-triviaqa-slope
bash evalchemy/reasoning_ablations.sh
The code used to create the ablated CoTs are available in ablation_data/.
bash evalchemy/non_reasoning_slow_think.sh
The few-shot slow thinking examples are available in evalchemy/eval/chat_benchmarks/non_reasoning_slow_think/few_shot_prompt.py
@inproceedings{
yoon2025reasoning,
title={Reasoning Models Better Express Their Confidence},
author={Dongkeun Yoon and Seungone Kim and Sohee Yang and Sunkyoung Kim and Soyeon Kim and Yongil Kim and Eunbi Choi and Yireun Kim and Minjoon Seo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=rbBtoVnduo}
}
