[NeurIPS 2025] Reasoning Models Better Express Their Confidence

[paper]

[tweet (breif overview of the paper)]

Summary

🙁 LLMs are overconfident even when they are dead wrong.

🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”?

❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

Installation

# clone the repository
pip install -e lm-eval-harness
pip install -e evalchemy
pip install vllm

Section 3: Main Experiment

For reasoning models that reliably generate "Confidence Reasoning":

bash evalchemy/scripts/reasoning_no_force.sh

For reasoning models that do not reliably generate "Confidence Reasoning" (R1 Distill, OR1-Preview, GLM Z1):

bash evalchemy/scripts/reasoning_force.sh

For non-reasoning models:

bash evalchemy/scripts/non_reasoning.sh

Finally, use the notebook results/calculate_metrics.ipynb to calculate ECE, Brier Score, and AUROC for the outputs.

Section 4: Analysis

Section 4.1: Linear Regression

For reasoning models:

bash evalchemy/reasoning_slope.sh

For non-reasoning models:

bash evalchemy/non_reasoning_slope.sh

Finally, use the notebook results/linear_regression.ipynb to run linear regression on the calibration metrics.

Change the dataset path and the model name appropriately referring to the list below.

Paths to the segmented CoTs

Reasoning Models

DKYoon/qwen3-think-nonambigqa-slope
DKYoon/qwen3-think-triviaqa-slope
DKYoon/r1-nonambigqa-slope
DKYoon/r1-triviaqa-slope
DKYoon/exaone-deep-nonambigqa-slope
DKYoon/exaone-deep-triviaqa-slope
DKYoon/glm-z1-nonambigqa-slope
DKYoon/glm-z1-triviaqa-slope

Non-Reasoning Models

DKYoon/qwen3-non-think-nonambigqa-slope
DKYoon/qwen3-non-think-triviaqa-slope
DKYoon/glm-instruct-nonambigqa-slope
DKYoon/glm-instruct-triviaqa-slope
DKYoon/exaone-instruct-nonambigqa-slope
DKYoon/exaone-instruct-triviaqa-slope
DKYoon/qwen25-nonambigqa-slope
DKYoon/qwen25-triviaqa-slope

Section 4.2: Ablation

bash evalchemy/reasoning_ablations.sh

The code used to create the ablated CoTs are available in ablation_data/.

Section 4.3: In-context Slow Thinking

bash evalchemy/non_reasoning_slow_think.sh

The few-shot slow thinking examples are available in evalchemy/eval/chat_benchmarks/non_reasoning_slow_think/few_shot_prompt.py

Citation

@inproceedings{
yoon2025reasoning,
title={Reasoning Models Better Express Their Confidence},
author={Dongkeun Yoon and Seungone Kim and Sohee Yang and Sunkyoung Kim and Soyeon Kim and Yongil Kim and Eunbi Choi and Yireun Kim and Minjoon Seo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=rbBtoVnduo}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
ablation_data		ablation_data
evalchemy		evalchemy
lm-evaluation-harness		lm-evaluation-harness
results		results
.gitignore		.gitignore
README.md		README.md
figure1.png		figure1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS 2025] Reasoning Models Better Express Their Confidence

Summary

Installation

Section 3: Main Experiment

Section 4: Analysis

Section 4.1: Linear Regression

Section 4.2: Ablation

Section 4.3: In-context Slow Thinking

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MattYoon/reasoning-models-confidence

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2025] Reasoning Models Better Express Their Confidence

Summary

Installation

Section 3: Main Experiment

Section 4: Analysis

Section 4.1: Linear Regression

Section 4.2: Ablation

Section 4.3: In-context Slow Thinking

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages