-
Notifications
You must be signed in to change notification settings - Fork 2k
Prioritized experience replay #1622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Prioritized experience replay #1622
Conversation
- Created SumTree (to be ultimated) - Started PrioritizedReplayBuffer - constructor and 'sample' method - to be tested
|
@araffin could you (or anyone) please have a look at the 2 pytype errors? I don't quite understand how to fix them |
|
Thanks @araffin ! |
to be consistent with the rest of the buffers and because PyTorch is not needed here (no gpu computation needed). |
|
Hello @araffin , |
Added list of rainbow extensions, specifying which ones are currently implemented in the library
yes probably, but the most important thing for now is to test the implementation (performance test, check we can reproduce the results from the paper), document it and add additional tests/doc (for sumtree for instance). |
|
Just a comment, I've tested this implementation with QR-DQN with Vecenv multiple environment but it fails because of the missing part. But good job to start the work on it! I hope it will be merged soon! 👍 |
|
I've just tried validating the implementation on blind cliffwalk and it seems much slower (~an order of magnitude) than the uniform replay buffer. The results below are for a one seed: Not sure why this is. The details for blind cliffwalk are a bit vague from the paper (no code available as well), but I've tried to implement it as close to the description as possible. Code for the test is in this gist: |
| weights = (self.size() * probs) ** -self.beta | ||
| weights = weights / weights.max() | ||
|
|
||
| # TODO: add proper support for multi env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could we add proper support for multiple envs? Is there any idea? Does the random line below could work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure yet, the random line below might work but we need to check if it won't affect performance first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
araffin/sbx@b5ce091 should be better, see araffin/sbx#50
|
Some update from my part, I just added CNN support for SBX (SB3 + Jax) DQN, and it is 10x faster than the PyTorch equivalent: araffin/sbx#49 That should allow to test and debug things more quickly on Atari (~1h40 for 10M steps instead of 15h =D) Perf report: https://wandb.ai/openrlbenchmark/sbx?nw=nwuseraraffin (on-going) |
|
Some additional update: when trying to plug the PER implementation of this PR inside the Jax DQN implementation, the experience replay was the bottleneck (by a good margin, making things 40x slower...), so I investigated different ways to speed things up. After playing with many different implementation (pure python, numpy, jax, jax jitted, ...), I decided to re-use the SB2 "SegmentTree" vectorized implementation and also implement proper multi-env support. (still debugging, but at least I've got the first sign of life and this implementation is so much faster) |
|
Hey @araffin , it is great to hear that. Does SBX/Jax means this much speed improvement? If you think it is ready for testing I can give a try, just let me know when it is ready to be tested. :) |
With the right parameters (see the exact command line argument for the RL Zoo in the OpenRL benchmark organization run on W&B), yes, around 10x faster.
SBX version is ready to be tested but so far, I didn't manage to see any gain from the PER. I also experienced some explosion in the qf value when using multiple env (so there is probably a bug here). |
|
When I tested this PR I also noticed an explosion in loss, in that time I felt that it is because of the tweaking here and there. and I also noticed that it doesn't give me any advantage over a normal buffer(and I used Dobule DQN, even tried duelling), but I tried to tweak an N-step buffer which had a strong effect on the learning, AFAIK N-step(multi step) is also part of Rainbow and giving substantial part of the success. The key parts are the distributional, PER and N-step parts, as far as I understand the concept. The others are kinda tasks specific parts and can be detrimental to use them. |
|
Hi @araffin @AlexPasqua,
This means we can avoid managing a sorted buffer and the associated complexity, while still converging to the same gradient. I've already implemented this approach. If you find it relevant, I’d be happy to open a PR. update: cf #2166 |
|
Yes, I'm currently not having the most time to work on this specific feature, so if someone can take it on as you were doing in the previous commits and comments, that'd be great. Of course, it's also possible to consider to implement only the propritized approzimation loss in #2166 instead |
|
Ok, nice. |
Thanks for the PR but for now I would prefer to have a full Rainbow implementation first. One of the main blocker currently is to check that the prioritized replay buffer is correctly implemented and to make it faster (I tried with jax in araffin/sbx#50 but couldn't get good results so far, see #1622 (comment)). |
Sure, I got you. Maybe we can add this equivalent loss under a completely different name? That way, we keep 'Prioritized Replay Buffer' reserved for the full Rainbow implementation, but we can still use this PAL method. It’s a very useful feature that is really missing from SB3 right now to introduce priority while keeping a reasonable training time. (And personally, I really need it for my research work so I'm eager to have it in! haha) |

Description
Implementation of prioritized replay buffer for DQN.
Closes #1242
Motivation and Context
In accordance with #1242
Types of changes
Checklist
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)make doc(required)Note: You can run most of the checks using
make commit-checks.Note: we are using a maximum length of 127 characters per line