Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions
Simon Matrenok* (EPFL), Skander Moalla* (EPFL), Caglar Gulcehre (EPFL)
🚀 Coming soon! 🚀
- 🧠💻📊 All the scripts we used to train and produce all the results presented in the paper (including reference data generation, training, and plotting).
- 🛠🏗️⚙️ All of our infrastructure and experiment management code for running and managing (tracking failures, etc.) experiments at scale on a SLURM cluster.
- 📦🐍🔒 A reference implementation for a scalable code sandbox on SLURM clusters with container runtimes, which does not require elevated privileges.
At the end of the day, it boils down to this:
def qrpo_loss(beta, logps, ref_logps, rewards, ref_rewards):
"""Compute the QRPO loss for a batch of prompts.
Args:
beta (`torch.Tensor: (1,)`):
The beta parameter for the QRPO loss.
logps (`torch.Tensor: (batch_size,)`):
Log probabilities of the training completions for the model.
ref_logps (`torch.Tensor: (batch_size,)`):
Log probabilities of the training completions for the reference model.
rewards (`torch.Tensor: (batch_size,)`):
Rewards of the training completions.
ref_rewards (`torch.Tensor: (batch_size, num_ref_rewards)`):
Rewards of the reference completions generated by the reference model.
Returns:
loss (`torch.Tensor[batch_size]`): The computed QRPO loss.
"""
log_ratios = logps - ref_logps
quantile_rewards = (ref_rewards <= rewards.unsqueeze(dim=1)).float().mean(dim=1)
log_Z = torch.log(beta) + 1 / beta # numerical simplification (Eq. 11)
loss = (quantile_rewards - beta * log_Z - beta * log_ratios) ** 2
return loss
@article{matrenok2025qrpo,
title={Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions},
author={Simon Matrenok and Skander Moalla and Caglar Gulcehre},
year={2025},
eprint={2507.08068},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.08068},
}