Skip to content

Ingester can get stuck with growing block queue #18

@scottyeager

Description

@scottyeager

There's a rare issue whereby the ingester can be queuing up new blocks to process but never processing them. While this is ongoing, there are occasional instances where some worker processes die and get respawned.

Here's a sample of logs:

2024-12-31 18:58:54.309482 processed 0 blocks in 30 seconds 10777 blocks queued 5 processes alive 0 write jobs
2024-12-31 18:59:24.876481 processed 0 blocks in 30 seconds 10782 blocks queued 5 processes alive 0 write jobs
2024-12-31 18:59:55.710535 processed 0 blocks in 30 seconds 10787 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:00:26.197006 processed 0 blocks in 30 seconds 10792 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:00:56.663733 processed 0 blocks in 30 seconds 10797 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:01:27.195469 processed 0 blocks in 30 seconds 10802 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:01:57.681833 processed 0 blocks in 30 seconds 10807 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:02:42.787301 processed 0 blocks in 30 seconds 10815 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:03:13.323732 processed 0 blocks in 30 seconds 10820 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:03:43.789056 processed 0 blocks in 30 seconds 10825 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:04:14.355872 processed 0 blocks in 30 seconds 10830 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:04:44.812938 processed 0 blocks in 30 seconds 10835 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:05:15.311260 processed 0 blocks in 30 seconds 10840 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:05:45.792779 processed 0 blocks in 30 seconds 10845 blocks queued 5 processes alive 0 write jobs

This is easy enough to catch via monitoring and address manually with a restart, but I wonder if a simple logic could achieve the same automatically. For example, if processed blocks over a certain time period (five minutes) is below some threshold, then abort and let the process manager restart. Checking that we actually have connectivity to tfchain might be a nice touch, but it's not a huge deal to restart every five minutes should network connectivity be lost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions