Skip to content

MattYoon/reasoning-models-confidence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[NeurIPS 2025] Reasoning Models Better Express Their Confidence

[paper]

[tweet (breif overview of the paper)]

Summary

🙁 LLMs are overconfident even when they are dead wrong.

🧐 What about reasoning models? Can they actually tell us “My answer is only 60% likely to be correct”?

❗Our paper suggests that they can! Through extensive analysis, we investigate what enables this emergent ability.

figure1

Installation

# clone the repository
pip install -e lm-eval-harness
pip install -e evalchemy
pip install vllm

Section 3: Main Experiment

  1. For reasoning models that reliably generate "Confidence Reasoning":
bash evalchemy/scripts/reasoning_no_force.sh
  1. For reasoning models that do not reliably generate "Confidence Reasoning" (R1 Distill, OR1-Preview, GLM Z1):
bash evalchemy/scripts/reasoning_force.sh
  1. For non-reasoning models:
bash evalchemy/scripts/non_reasoning.sh
  1. Finally, use the notebook results/calculate_metrics.ipynb to calculate ECE, Brier Score, and AUROC for the outputs.

Section 4: Analysis

Section 4.1: Linear Regression

  1. For reasoning models:
bash evalchemy/reasoning_slope.sh
  1. For non-reasoning models:
bash evalchemy/non_reasoning_slope.sh
  1. Finally, use the notebook results/linear_regression.ipynb to run linear regression on the calibration metrics.

Change the dataset path and the model name appropriately referring to the list below.

Paths to the segmented CoTs

Reasoning Models

  • DKYoon/qwen3-think-nonambigqa-slope
  • DKYoon/qwen3-think-triviaqa-slope
  • DKYoon/r1-nonambigqa-slope
  • DKYoon/r1-triviaqa-slope
  • DKYoon/exaone-deep-nonambigqa-slope
  • DKYoon/exaone-deep-triviaqa-slope
  • DKYoon/glm-z1-nonambigqa-slope
  • DKYoon/glm-z1-triviaqa-slope

Non-Reasoning Models

  • DKYoon/qwen3-non-think-nonambigqa-slope
  • DKYoon/qwen3-non-think-triviaqa-slope
  • DKYoon/glm-instruct-nonambigqa-slope
  • DKYoon/glm-instruct-triviaqa-slope
  • DKYoon/exaone-instruct-nonambigqa-slope
  • DKYoon/exaone-instruct-triviaqa-slope
  • DKYoon/qwen25-nonambigqa-slope
  • DKYoon/qwen25-triviaqa-slope

Section 4.2: Ablation

bash evalchemy/reasoning_ablations.sh

The code used to create the ablated CoTs are available in ablation_data/.

Section 4.3: In-context Slow Thinking

bash evalchemy/non_reasoning_slow_think.sh

The few-shot slow thinking examples are available in evalchemy/eval/chat_benchmarks/non_reasoning_slow_think/few_shot_prompt.py


Citation

@inproceedings{
yoon2025reasoning,
title={Reasoning Models Better Express Their Confidence},
author={Dongkeun Yoon and Seungone Kim and Sohee Yang and Sunkyoung Kim and Soyeon Kim and Yongil Kim and Eunbi Choi and Yireun Kim and Minjoon Seo},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=rbBtoVnduo}
}

About

[NeurIPS 2025] Reasoning Models Better Express Their Confidence"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published