This is the repository for the ACL Findings Paper "Bridging relevance and reasoning: Rationale distillation in retrieval-augmented generation".
# install from github
git clone https://github.com/RUC-NLPIR/FlashRAG.git
cd FlashRAG
pip install -e .[core]
pip install vllm==0.6.0
pip install FlagEmbedding==1.2.11
pip install evaluate
# Install all extra dependencies
pip install flashrag-dev[full]
# install faiss
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
pip install -U bitsandbytes
pip install pandarallel
Please first download the corpus wiki18 from Huggingface or ModelScope, then run the following script.
CUDA_VISIBLE_DEVICES=2 python -m flashrag.retriever.index_builder \
--retrieval_method e5 \
--model_path xxx/.cache/huggingface/hub/models--intfloat--e5-base-v2/snapshots/1c644c92ad3ba1efdad3f1451a637716616a20e8/ \
--corpus_path indexes/retrieval-corpus/wiki-18.jsonl \
--save_dir indexes/ \
--use_fp16 \
--max_length 512 \
--batch_size 512 \
--pooling_method mean \
--faiss_type Flat
--utils
|--basic_config.yaml. The file containing all parameters in flashrag
|--classes.py. Customized RAG pipelines.
--dataset_generation.py. Python script to generate fine-tuning datasets.
--llm_generation.py. Python script to generate rationales for datasets.
--mmlu_evaluation.py. Python script to evaluate mmlu dataset on different categories.
--mmlu_preprocess.py. Python script to preprocess mmlu dataset.
--run_finetune.sh. Script to fine-tune rerankers.
--run_rag.py. Python script to run rag pipelines.
--score_cache_to_dataset.py. Python script to integrate different scores to a dataset.
python llm_generation.py --api_key=xxx --base_url=xxx --dataset=nq --model=gpt-4o-mini
Our curated reasoning datasets can be downloaded from HuggingFace.
-
Generate score files on rationales and retrieval separately
python dataset_generation.py --dataset=xxx --phase=get_docs
python dataset_generation.py --dataset=xxx --phase=generate_dataset --method=reason
python dataset_generation.py --dataset=xxx --phase=generate_dataset --method=retrieval
-
Integrate rationale-based and retrieval scores and generate dataset
python score_cache_to_dataset.py
sh run_finetune.sh
python run_rag.py --rerank_model_path=xxx --dataset=nq
If you want to use generators with openai api, you can try
python run_rag.py --rerank_model_path=xxx --framework=openai --openai_model=gpt-4o-mini
If you find our resources useful, we really appreciate your citation!
@inproceedings{jia-etal-2025-bridging,
title = "Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation",
author = "Jia, Pengyue and
Xu, Derong and
Li, Xiaopeng and
Du, Zhaocheng and
Li, Xiangyang and
Wang, Yichao and
Wang, Yuhao and
Liu, Qidong and
Wang, Maolin and
Guo, Huifeng and
Tang, Ruiming and
Zhao, Xiangyu",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.220/",
pages = "4242--4256",
ISBN = "979-8-89176-256-5",
abstract = "The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of large language models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction."
}