Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Simon Matrenok* (EPFL), Skander Moalla* (EPFL), Caglar Gulcehre (EPFL)

🚀 Coming soon! 🚀

🧠💻📊 All the scripts we used to train and produce all the results presented in the paper (including reference data generation, training, and plotting).
🛠🏗️⚙️ All of our infrastructure and experiment management code for running and managing (tracking failures, etc.) experiments at scale on a SLURM cluster.
📦🐍🔒 A reference implementation for a scalable code sandbox on SLURM clusters with container runtimes, which does not require elevated privileges.

At the end of the day, it boils down to this:

def qrpo_loss(beta, logps, ref_logps, rewards, ref_rewards):
    """Compute the QRPO loss for a batch of prompts.
    Args:
        beta (`torch.Tensor: (1,)`):
            The beta parameter for the QRPO loss.
        logps (`torch.Tensor: (batch_size,)`):
            Log probabilities of the training completions for the model.
        ref_logps (`torch.Tensor: (batch_size,)`):
            Log probabilities of the training completions for the reference model.
        rewards (`torch.Tensor: (batch_size,)`):
            Rewards of the training completions.
        ref_rewards (`torch.Tensor: (batch_size, num_ref_rewards)`):
            Rewards of the reference completions generated by the reference model.
    Returns:
        loss (`torch.Tensor[batch_size]`): The computed QRPO loss.
    """
    log_ratios = logps - ref_logps
    quantile_rewards = (ref_rewards <= rewards.unsqueeze(dim=1)).float().mean(dim=1)
    log_Z = torch.log(beta) + 1 / beta      # numerical simplification (Eq. 11)
    loss = (quantile_rewards - beta * log_Z - beta * log_ratios) ** 2
    return loss

Citation

@article{matrenok2025qrpo,
    title={Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions},
    author={Simon Matrenok and Skander Moalla and Caglar Gulcehre},
    year={2025},
    eprint={2507.08068},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.08068},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Citation

About

Uh oh!

License

CLAIRE-Labo/quantile-reward-policy-optimization

Folders and files

Latest commit

History

Repository files navigation

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks