-
Notifications
You must be signed in to change notification settings - Fork 917
Improve beacon processor queue metrics #8020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
This didn't really work as expected, I'm going to continue debugging before merging. |
It turns out the main issue with the graph is how the histogram samples were being displayed. We were doing a The queue lengths became histograms in @dapplion's PR: I've updated the metrics charts to use a shorter 1m period, and the spikes are now visible. This is a node running ![]() Strangely, this PR vs unstable doesn't seem to make much difference. Even when I forced an update of every queue after every event in this commit b033e1f, the chart looks extremely similar: ![]() There are still a few challenges:
So maybe we just implement the simple fix, keep the dashboard updates I made, and be aware that these queue metrics are just difficult to interpret? |
@michaelsproul here a heatmap plot would make more sense I believe. What do you want to visualize exactly? If we take the rate at 1m for a range of values, say >1 for a specific queue length, and then divide that per count we can visualize for that 1m range how many times the queue was non-empty when adding a message. A heatmap will show a better graphic showing for that 1 minute period what % of times the queue was at 0, 10, 100 etc. |
I'll give the heatmap a go next week. I also like the idea of sampling only on inbound events. That way we can say something vaguely meaningful about percentages of samples. E.g. for 50% of inbound events the queue was empty. I find this more intuitive than sampling every time there is any kind of event, due to the issues with bias on dequeueing I described. I was also thinking about why the graph doesn't go to 0 with the 0.99 quantile and 1m samples. I guess we are showing the maximum number of attestations in the queue in each 1m period, and it's not surprising that this is non-zero? |
This reverts commit b033e1f.
Downside of only sampling on event arrival is that messages that occur infrequently (like exits, unknown head attestations, etc) have their "latest" queue length get stuck at whatever value it was when the last message arrived. Idk, I kind of don't like any of the options here 🤣 |
nvm just read your comment above and looked at unstable, sounds like it does this already |
Issue Addressed
I noticed that some of the beacon processor queue metrics remain at high levels, which doesn't really make sense if messages are being constantly processed. It turns out, some of the queue metrics could get stuck because of the way we do metrics for batched messages (attestations and aggregate attestations).
There also seems to be an issue involving
unknown_block_attestations
which I am going to investigate separately.Proposed Changes
When processing a batch of attestations/aggregates, update the queue length for the message that was batched as well.
Additional Info
This complements (but should not conflict too much with):