Skip to content

CLAIRE-Labo/quantile-reward-policy-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

QRPO Sketch

Paper Blog License: MIT

Simon Matrenok* (EPFL), Skander Moalla* (EPFL), Caglar Gulcehre (EPFL)

🚀 Coming soon! 🚀

  • 🧠💻📊 All the scripts we used to train and produce all the results presented in the paper (including reference data generation, training, and plotting).
  • 🛠🏗️⚙️ All of our infrastructure and experiment management code for running and managing (tracking failures, etc.) experiments at scale on a SLURM cluster.
  • 📦🐍🔒 A reference implementation for a scalable code sandbox on SLURM clusters with container runtimes, which does not require elevated privileges.

At the end of the day, it boils down to this:

def qrpo_loss(beta, logps, ref_logps, rewards, ref_rewards):
    """Compute the QRPO loss for a batch of prompts.
    Args:
        beta (`torch.Tensor: (1,)`):
            The beta parameter for the QRPO loss.
        logps (`torch.Tensor: (batch_size,)`):
            Log probabilities of the training completions for the model.
        ref_logps (`torch.Tensor: (batch_size,)`):
            Log probabilities of the training completions for the reference model.
        rewards (`torch.Tensor: (batch_size,)`):
            Rewards of the training completions.
        ref_rewards (`torch.Tensor: (batch_size, num_ref_rewards)`):
            Rewards of the reference completions generated by the reference model.
    Returns:
        loss (`torch.Tensor[batch_size]`): The computed QRPO loss.
    """
    log_ratios = logps - ref_logps
    quantile_rewards = (ref_rewards <= rewards.unsqueeze(dim=1)).float().mean(dim=1)
    log_Z = torch.log(beta) + 1 / beta      # numerical simplification (Eq. 11)
    loss = (quantile_rewards - beta * log_Z - beta * log_ratios) ** 2
    return loss

Citation

@article{matrenok2025qrpo,
    title={Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions},
    author={Simon Matrenok and Skander Moalla and Caglar Gulcehre},
    year={2025},
    eprint={2507.08068},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.08068},
}

About

Official codebase for "Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions" (Matrenok et al. 2025).

Topics

Resources

License

Stars

Watchers

Forks