Skip to content

Conversation

michaelsproul
Copy link
Member

@michaelsproul michaelsproul commented Sep 10, 2025

Issue Addressed

I noticed that some of the beacon processor queue metrics remain at high levels, which doesn't really make sense if messages are being constantly processed. It turns out, some of the queue metrics could get stuck because of the way we do metrics for batched messages (attestations and aggregate attestations).

queue_metrics
  • yellow = gossip_attestation
  • light blue = unknown_block_attestation
  • light brown = voluntary exit (??)
  • orange (small) = unknown_block_aggregate

There also seems to be an issue involving unknown_block_attestations which I am going to investigate separately.

Proposed Changes

When processing a batch of attestations/aggregates, update the queue length for the message that was batched as well.

Additional Info

This complements (but should not conflict too much with):

@michaelsproul michaelsproul added bug Something isn't working ready-for-review The code is ready for review low-hanging-fruit Easy to resolve, get it before someone else does! UX-and-logs labels Sep 10, 2025
Copy link
Member

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@michaelsproul
Copy link
Member Author

This didn't really work as expected, I'm going to continue debugging before merging.

@michaelsproul michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 11, 2025
@mergify mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 11, 2025
@michaelsproul
Copy link
Member Author

michaelsproul commented Sep 11, 2025

It turns out the main issue with the graph is how the histogram samples were being displayed. We were doing a rate over a 30m period, which combined with q=0.99 means we show (almost) the maximum value for that period, for the whole period.

The queue lengths became histograms in @dapplion's PR:

I've updated the metrics charts to use a shorter 1m period, and the spikes are now visible. This is a node running unstable with the updated chart:

attestation_queues

Strangely, this PR vs unstable doesn't seem to make much difference. Even when I forced an update of every queue after every event in this commit b033e1f, the chart looks extremely similar:

attestation_queues_forced

There are still a few challenges:

  • It's hard to tell if there are messages lurking in queues: the chart rarely drops to 0, but I think this could be an artefact of the 1m period and the 0.99 quantile. Using a quantile of 0.5 does show lots of drops to 0, which is what we expect.
  • It's hard to measure the fraction of the time that the queue is empty, because it depends so heavily on how often samples are taken. If we sample every time an event affecting the queue occurs (what unstable does), this biases towards instances where the queue is non-empty, because if it is changing it is very often changing to a non-empty length. On the other hand, if we sample every queue length every iteration (as I do in b033e1f), then we will see more 0 counts. Neither approach seems particularly satisfying to me, and yet in the dashboards, they look almost identical 🤣

So maybe we just implement the simple fix, keep the dashboard updates I made, and be aware that these queue metrics are just difficult to interpret?

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 12, 2025
@dapplion
Copy link
Collaborator

@michaelsproul here a heatmap plot would make more sense I believe. What do you want to visualize exactly? If we take the rate at 1m for a range of values, say >1 for a specific queue length, and then divide that per count we can visualize for that 1m range how many times the queue was non-empty when adding a message. A heatmap will show a better graphic showing for that 1 minute period what % of times the queue was at 0, 10, 100 etc.

@michaelsproul
Copy link
Member Author

I'll give the heatmap a go next week.

I also like the idea of sampling only on inbound events. That way we can say something vaguely meaningful about percentages of samples. E.g. for 50% of inbound events the queue was empty. I find this more intuitive than sampling every time there is any kind of event, due to the issues with bias on dequeueing I described.

I was also thinking about why the graph doesn't go to 0 with the 0.99 quantile and 1m samples. I guess we are showing the maximum number of attestations in the queue in each 1m period, and it's not surprising that this is non-zero?

@mergify mergify bot added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 15, 2025
@michaelsproul michaelsproul changed the title Fix beacon processor metrics for batched messages Improve beacon processor queue metrics Sep 15, 2025
@michaelsproul
Copy link
Member Author

Downside of only sampling on event arrival is that messages that occur infrequently (like exits, unknown head attestations, etc) have their "latest" queue length get stuck at whatever value it was when the last message arrived. Idk, I kind of don't like any of the options here 🤣

@eserilev
Copy link
Member

eserilev commented Sep 16, 2025

maybe we could update the queue length metrics inside the push/pop fns in the LIFO/FIFO impls?

nvm just read your comment above and looked at unstable, sounds like it does this already

@michaelsproul michaelsproul added the beacon-processor Glorious beacon processor, guardian against chaos yet chaotic itself label Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beacon-processor Glorious beacon processor, guardian against chaos yet chaotic itself bug Something isn't working low-hanging-fruit Easy to resolve, get it before someone else does! ready-for-review The code is ready for review UX-and-logs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants