The overview and official implementation of TV score used in OOD Detection in Mathematical Reasoning.
Details are shown in our paper.
Abbreviations: In-Distribution -> ID; Out-of-Distribuion -> OOD
- 
Input Space: Low distinction between different domains 
- 
Output Space: compressed high-density search space -> pattern collapse 
- Constraints on trajectory endpoints in mathematical reasoning allow for a greater likelihood of variation in trajectory volatility under different samples.
A trajectory-based algorithm to detect OOD samples in mathematical reasoning scenarios.
Algorithm Pipeline:
We denote 
- Step 1: Mahalanobis Distance Mapping
- Step 2: Average of Absolute Value Difference
TV score w/o Differential Smoothing when 
First, fine-tuning the base model with ID dataset (MultiArith).
cd your/project/root/folder/path/
bash Scripts/finetune.shDetails of Scripts/finetune.sh are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"   # gpu id
model_name="llama2-7b"  # SFT model name
python FineTune/ID_finetune.py --model_name $model_nameAfter fine-tuning, checkpoints will be stored in os.environ['PROJECT_PATH']/Checkpoints/$model_name
Next, conduct inference for all ID/OOD datasets using the checkpoint just fine-tuned.
cd your/project/root/folder/path/
bash Scripts/inference.shDetails of Scripts/inference.sh are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1"  # gpu id
model_name="llama2-7b"  # SFT model name
max_output_token_num="16"
ckpt_step="10000"   # checkpoint step as you selected
dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
    python Inference/ID_OOD_inference.py --model_name $model_name \
                                         --dataset "$i" \
                                         --category $category \
                                         --max_output_token_num $max_output_token_num \
                                         --ckpt_step $ckpt_step
done
dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
    python Inference/ID_OOD_inference.py --model_name $model_name \
                                         --dataset $dataset \
                                         --category "$i" \
                                         --max_output_token_num $max_output_token_num \
                                         --ckpt_step $ckpt_step
doneAfter inference, all inference results will be stored in os.environ['PROJECT_PATH']/Data/Inference_Data/$model_name. Each sample corresponds to one dictionary.
{  
  "id": i,
  "hidden_state": hidden_states,
  "output_scores": output_scores,
  "output_seq": output_seq
}Finally, computer TV scores for each dataset in all ID/OOD datasets.
cd your/project/root/folder/path/
bash Scripts/computation.shDetails of Scripts/computation.sh are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
model_name="llama2-7b"
max_output_token_num="16"
max_order="5"   # Differential Smoothing Order
dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
    python Computation/ID_OOD_score.py --model_name $model_name \
                                       --dataset "$i" \
                                       --category $category \
                                       --max_output_token_num $max_output_token_num \
                                       --max_order $max_order
done
dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
    python Computation/ID_OOD_score.py --model_name $model_name \
                                       --dataset $dataset \
                                       --category "$i" \
                                       --max_output_token_num $max_output_token_num \
                                       --max_order $max_order
doneAfter computation, all scores will be stored in os.environ['PROJECT_PATH']/Data/Score_Data/$model_name.
If methods and conclusions in our paper are inspiring, you can support our work by citation. Thanks!
@article{wang2024trajectory,
  title={Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning},
  author={Wang, Yiming and Zhang, Pei and Yang, Baosong and Wong, Derek F and Zhang, Zhuosheng and Wang, Rui},
  journal={arXiv preprint arXiv:2405.14039},
  year={2024}
}



