Skip to content

nightly/quantitative-reward-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quantitative Reward Monitors for Reinforcement Learning

This repository provides the reproducible code for our paper "Expressive Temporal Specifications for Reward Monitoring" for AAAI 2026.

Designing informative reward functions is one of the main bottlenecks in reinforcement learning: sparse or badly shaped rewards slow down learning and can push agents towards undesirable behaviours.

This work uses quantitative Linear Temporal Logic on finite traces ($\mathrm{LTL}_f[\mathcal{F}]$) to define Quantitative Reward Monitors (QRMs), which are a type of finite state machine with registers. Each monitor is synthesized from a temporal specification and returns a scalar "degree of satisfaction" at every time step for a given formula, rather than a single Boolean outcome. This provides dense, temporally aware feedback that guides agents more efficiently towards the specified behaviour.

Key advantages:

  • Expressive specifications: temporal logic over finite traces with quantitative operators, supporting both reachability and safety objectives and non-Markovian goals, expressed in a formal yet relatively simple, concise manner.
  • Automatic monitor synthesis: monitors are constructed in time linear in the size of the formula, with correctness and complexity guarantees.
  • Safety-aware monitoring: safety specifications can permanently clamp rewards after a violation, blocking positive returns from other monitors.
  • Algorithm-agnostic integration: monitors are composed with the environment as a product MDP; any standard RL algorithm can be used on this extended environment.

Setup instructions

  1. Clone with git git clone --recurse-submodules --shallow-submodules https://github.com/nightly/quantitative-reward-monitoring
  2. Ensure GraphViz is installed on your system (used for export visualisations)
  3. Use requirements.txt with pip pip install -r requirements.txt or, preferably uv sync if you have Astral's uv installed

Code structure

quantitative-reward-monitoring/
├── experiments/               # Jupyter notebooks and experiment scripts
│   ├── visualise/            # Plotting utilities and visualisations
├── src/ 
│   ├── environments/         # Wrappers and helpers for different RL environments
│   │   ├── toy/              # Cliff Walking, Frozen Lake, Taxi gridworlds
│   │   ├── safety_gridworlds/# Conveyor Belt, Island Navigation, Sokoban tasks
│   │   ├── classic_control/  # Acrobot, Cartpole, Mountain Car, Pendulum
│   │   ├── box2d/            # Bipedal Walker, Lunar Lander
│   ├── reward_monitors/
│   │   ├── bool/             # Boolean monitor semantics
│   │   │   ├── syntax/       # Safe LTLf checking (syntactically)
│   │   │   └── …
│   │   ├── quantitative/     # Quantitative monitor semantics
│   │   │   ├── syntax/       # Safe LTLf[F] checking (syntactically)
│   │   │   └── …
│   │   └── …                 # Shared utilities for monitors
│   ├── rl/                  
│   │   ├── q_learning.py     # Tabular Q‑learning implementation
│   │   ├── ppo_continuous_action.py  # PPO for continuous actions
│   │   └── …                 # Other RL algorithms & helpers (including convergence checking)
│   └── core/                 # Core utilities
└── …

Cite

To cite the paper or code where useful in your publications, a BibTeX citation is provided as follows:

@article{adalat2025expressive,
  title={Expressive Temporal Specifications for Reward Monitoring},
  author={Adalat, Omar and Belardinelli, Francesco},
  journal={arXiv preprint arXiv:2511.12808},
  year={2025}
}

Credits

Some of the original code & ideas were contributed by:

  • Jialue Xu
  • Alexander Philip Rader

About

Code for our AAAI 26 paper: "Expressive Temporal Specifications for Reward Monitoring"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published