Improve beacon processor queue metrics #8020

michaelsproul · 2025-09-10T02:38:56Z

Issue Addressed

I noticed that some of the beacon processor queue metrics remain at high levels, which doesn't really make sense if messages are being constantly processed. It turns out, some of the queue metrics could get stuck because of the way we do metrics for batched messages (attestations and aggregate attestations).

yellow = gossip_attestation
light blue = unknown_block_attestation
light brown = voluntary exit (??)
orange (small) = unknown_block_aggregate

There also seems to be an issue involving unknown_block_attestations which I am going to investigate separately.

Proposed Changes

When processing a batch of attestations/aggregates, update the queue length for the message that was batched as well.

Additional Info

This complements (but should not conflict too much with):

Move work queue structs and initialization logic to a separate file and a small refactor #7438

eserilev

Nice catch!

michaelsproul · 2025-09-11T01:04:09Z

This didn't really work as expected, I'm going to continue debugging before merging.

michaelsproul · 2025-09-11T06:35:52Z

It turns out the main issue with the graph is how the histogram samples were being displayed. We were doing a rate over a 30m period, which combined with q=0.99 means we show (almost) the maximum value for that period, for the whole period.

The queue lengths became histograms in @dapplion's PR:

Register processor queue length as histogram #6012

I've updated the metrics charts to use a shorter 1m period, and the spikes are now visible. This is a node running unstable with the updated chart:

Strangely, this PR vs unstable doesn't seem to make much difference. Even when I forced an update of every queue after every event in this commit b033e1f, the chart looks extremely similar:

There are still a few challenges:

It's hard to tell if there are messages lurking in queues: the chart rarely drops to 0, but I think this could be an artefact of the 1m period and the 0.99 quantile. Using a quantile of 0.5 does show lots of drops to 0, which is what we expect.
It's hard to measure the fraction of the time that the queue is empty, because it depends so heavily on how often samples are taken. If we sample every time an event affecting the queue occurs (what unstable does), this biases towards instances where the queue is non-empty, because if it is changing it is very often changing to a non-empty length. On the other hand, if we sample every queue length every iteration (as I do in b033e1f), then we will see more 0 counts. Neither approach seems particularly satisfying to me, and yet in the dashboards, they look almost identical 🤣

So maybe we just implement the simple fix, keep the dashboard updates I made, and be aware that these queue metrics are just difficult to interpret?

beacon_node/beacon_processor/src/lib.rs

dapplion · 2025-09-12T18:11:16Z

@michaelsproul here a heatmap plot would make more sense I believe. What do you want to visualize exactly? If we take the rate at 1m for a range of values, say >1 for a specific queue length, and then divide that per count we can visualize for that 1m range how many times the queue was non-empty when adding a message. A heatmap will show a better graphic showing for that 1 minute period what % of times the queue was at 0, 10, 100 etc.

michaelsproul · 2025-09-13T02:53:24Z

I'll give the heatmap a go next week.

I also like the idea of sampling only on inbound events. That way we can say something vaguely meaningful about percentages of samples. E.g. for 50% of inbound events the queue was empty. I find this more intuitive than sampling every time there is any kind of event, due to the issues with bias on dequeueing I described.

I was also thinking about why the graph doesn't go to 0 with the 0.99 quantile and 1m samples. I guess we are showing the maximum number of attestations in the queue in each 1m period, and it's not surprising that this is non-zero?

This reverts commit b033e1f.

michaelsproul · 2025-09-16T23:16:33Z

Downside of only sampling on event arrival is that messages that occur infrequently (like exits, unknown head attestations, etc) have their "latest" queue length get stuck at whatever value it was when the last message arrived. Idk, I kind of don't like any of the options here 🤣

eserilev · 2025-09-16T23:42:37Z

~~maybe we could update the queue length metrics inside the push/pop fns in the LIFO/FIFO impls?~~

nvm just read your comment above and looked at unstable, sounds like it does this already

Fix beacon processor metrics for batched messages

2badfb7

michaelsproul added bug Something isn't working ready-for-review The code is ready for review low-hanging-fruit Easy to resolve, get it before someone else does! UX-and-logs labels Sep 10, 2025

eserilev approved these changes Sep 10, 2025

View reviewed changes

michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 11, 2025

mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 11, 2025

More bruteforce approach

b033e1f

jimmygchen reviewed Sep 12, 2025

View reviewed changes

beacon_node/beacon_processor/src/lib.rs Outdated Show resolved Hide resolved

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 12, 2025

Revert "More bruteforce approach"

dd1f185

This reverts commit b033e1f.

mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 15, 2025

New approach: sample length only on receive

b90804b

michaelsproul changed the title ~~Fix beacon processor metrics for batched messages~~ Improve beacon processor queue metrics Sep 15, 2025

michaelsproul added the beacon-processor Glorious beacon processor, guardian against chaos yet chaotic itself label Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve beacon processor queue metrics #8020

Improve beacon processor queue metrics #8020

Uh oh!

michaelsproul commented Sep 10, 2025 •

edited

Loading

Uh oh!

eserilev left a comment

Uh oh!

michaelsproul commented Sep 11, 2025

Uh oh!

michaelsproul commented Sep 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

dapplion commented Sep 12, 2025

Uh oh!

michaelsproul commented Sep 13, 2025

Uh oh!

michaelsproul commented Sep 16, 2025

Uh oh!

eserilev commented Sep 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve beacon processor queue metrics #8020

Are you sure you want to change the base?

Improve beacon processor queue metrics #8020

Uh oh!

Conversation

michaelsproul commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

eserilev left a comment

Choose a reason for hiding this comment

Uh oh!

michaelsproul commented Sep 11, 2025

Uh oh!

michaelsproul commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dapplion commented Sep 12, 2025

Uh oh!

michaelsproul commented Sep 13, 2025

Uh oh!

michaelsproul commented Sep 16, 2025

Uh oh!

eserilev commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelsproul commented Sep 10, 2025 •

edited

Loading

michaelsproul commented Sep 11, 2025 •

edited

Loading

eserilev commented Sep 16, 2025 •

edited

Loading