Skip to content

CN Needs an Exponential Backoff applied to BN in all failure cases #21214

@derektriley

Description

@derektriley

Background

In a scenario where 1 CN is streaming to 1 BN (only available), the BN starts to slowly fall behind and eventually the block being streamed is no longer in the block buffer (because writerMode = FILE_AND_GRPC and streamMode = BOTH).

The CN then gets into a repeating pattern of ending the stream, trying to stream the current block, receiving behind, sending too far behind, restarting stream with 0ms delay.

This leads to excessive log spam (will be fixed in separate issue with log statements at DEBUG level) and excessive exception throwing as streaming threads attempt to stream while stream is being closed.

For example:

2025-09-23 16:15:41.946 DEBUG 731  BlockNodeConnectionManager - [10.68.30.112:40840/ACTIVE] Processing block 249 (isBlockProofSent=false, totalBlockRequests=292, currentRequestIndex=221)
2025-09-23 16:15:41.952 DEBUG 731  BlockNodeConnectionManager - [10.68.30.112:40840/ACTIVE] Processing block 249 (isBlockProofSent=false, totalBlockRequests=292, currentRequestIndex=222)
2025-09-23 16:15:41.957 DEBUG 676  BlockBufferService - Block buffer status: idealMaxBufferSize=150, blocksChecked=132, blocksPruned=1, blocksPendingAck=130, blockRange=[250, 379], saturation=86.6667%
2025-09-23 16:15:41.958 DEBUG 708  BlockNodeConnectionManager - [10.68.30.112:40840/ACTIVE] Block 249 not found in buffer (latestBlock=380); connection will be closed
2025-09-23 16:15:41.958 DEBUG 503  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Closing connection...
2025-09-23 16:15:41.958 DEBUG 517  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Closing request pipeline for block node
2025-09-23 16:15:41.958 DEBUG 649  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Stream completed (stream close was in progress)
2025-09-23 16:15:41.958 DEBUG 523  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Request pipeline successfully closed
2025-09-23 16:15:41.958 DEBUG 172  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Connection state transitioned from ACTIVE to CLOSED
2025-09-23 16:15:41.958 DEBUG 210  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Cancelled periodic stream reset
2025-09-23 16:15:41.958 DEBUG 566  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Jumping to block -1
2025-09-23 16:15:41.958 DEBUG 793  BlockNodeConnectionManager - Marking request to jump to block -1
2025-09-23 16:15:41.958 DEBUG 509  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Connection successfully closed
2025-09-23 16:15:41.958 WARN  323  BlockNodeConnectionManager - [10.68.30.112:40840/CLOSED] Closing and rescheduling connection for reconnect attempt
2025-09-23 16:15:41.959 DEBUG 426  BlockNodeConnectionManager - [10.68.30.112:40840/CLOSED] Scheduling reconnection for node in 0 ms
2025-09-23 16:15:41.965 DEBUG 440  BlockNodeConnectionManager - [10.68.30.112:40840/CLOSED] Successfully scheduled reconnection task
2025-09-23 16:15:41.965 DEBUG 837  BlockNodeConnectionManager - [10.68.30.112:40840/UNINITIALIZED] Running connection task...
2025-09-23 16:15:41.967 DEBUG 172  BlockNodeConnection - [10.68.30.112:40840/PENDING] Connection state transitioned from UNINITIALIZED to PENDING
2025-09-23 16:15:41.967 DEBUG 172  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Connection state transitioned from PENDING to ACTIVE
2025-09-23 16:15:41.967 DEBUG 195  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Scheduled periodic stream reset every PT24H
2025-09-23 16:15:41.975 DEBUG 769  BlockNodeConnectionManager - Jumping to block 380
2025-09-23 16:15:42.004 DEBUG 644  BlockNodeConnection - [10.68.30.112:40840/CLOSED] onComplete invoked but connection is already closed
2025-09-23 16:15:42.102 DEBUG 328  FileBlockItemWriter - Started new block in FileBlockItemWriter 381
2025-09-23 16:15:42.102 DEBUG 51   GrpcBlockItemWriter - Started new block in GrpcBlockItemWriter 381
2025-09-23 16:15:42.108 DEBUG 383  FileBlockItemWriter - Closed block in FileBlockItemWriter 380
2025-09-23 16:15:42.108 DEBUG 225  BlockState - [Block 380] closed at 2025-09-23T16:15:42.108497011Z
2025-09-23 16:15:42.108 DEBUG 85   GrpcBlockItemWriter - Closed block in GrpcBlockItemWriter 380
2025-09-23 16:15:42.208 DEBUG 273  BlockBufferIO - Block 317 (items: 76866) written to file: /opt/hgcapp/blockStreams/buffer/1758644114033/block-317.bin (bytes: 83195689)
2025-09-23 16:15:42.302 DEBUG 731  BlockNodeConnectionManager - [10.68.30.112:40840/ACTIVE] Processing block 380 (isBlockProofSent=false, totalBlockRequests=305, currentRequestIndex=0)
2025-09-23 16:15:42.316 DEBUG 731  BlockNodeConnectionManager - [10.68.30.112:40840/ACTIVE] Processing block 380 (isBlockProofSent=false, totalBlockRequests=305, currentRequestIndex=1)
2025-09-23 16:15:42.323 DEBUG 326  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Received EndOfStream response (block=248, responseCode=BEHIND)
2025-09-23 16:15:42.324 DEBUG 393  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Block node is behind and block state is not available.
2025-09-23 16:15:42.326 DEBUG 503  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Closing connection...
2025-09-23 16:15:42.326 DEBUG 517  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Closing request pipeline for block node
2025-09-23 16:15:42.327 DEBUG 649  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Stream completed (stream close was in progress)
2025-09-23 16:15:42.327 DEBUG 523  BlockNodeConnection - [10.68.30.112:40840/ACTIVE] Request pipeline successfully closed
2025-09-23 16:15:42.327 DEBUG 172  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Connection state transitioned from ACTIVE to CLOSED
2025-09-23 16:15:42.327 DEBUG 210  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Cancelled periodic stream reset
2025-09-23 16:15:42.327 DEBUG 566  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Jumping to block -1
2025-09-23 16:15:42.327 DEBUG 793  BlockNodeConnectionManager - Marking request to jump to block -1
2025-09-23 16:15:42.327 DEBUG 509  BlockNodeConnection - [10.68.30.112:40840/CLOSED] Connection successfully closed
2025-09-23 16:15:42.327 WARN  323  BlockNodeConnectionManager - [10.68.30.112:40840/CLOSED] Closing and rescheduling connection for reconnect attempt
2025-09-23 16:15:42.327 DEBUG 426  BlockNodeConnectionManager - [10.68.30.112:40840/CLOSED] Scheduling reconnection for node in 0 ms
2025-09-23 16:15:42.327 DEBUG 682  BlockNodeConnectionManager - Exception caught in block stream worker loop Attempting to send data frame in invalid state: HALF_CLOSED_LOCAL

Acceptance Criteria

  1. Introduce exp backoff for the above cases and ensure all cases have backoff.
  2. Add HAPI tests to Block Node XTS suite exercising this behavior.
  3. Any additional Unit Tests

Dependencies

No response

Definition of Ready (DoR) Checklist

  • Clear acceptance criteria
  • Clear and detailed description
  • Dependencies identified
  • Links to documentation
  • Should be completable in 2-3 Days
  • Initial draft of Low-level design document
  • At least high level test plan
  • Groomed/Estimated

Definition of Done (DoD) Checklist

  • Acceptance Criteria complete
  • No Codacy issues greater than minor (in new code)
  • JavaDocs updated/created
  • Code commented
  • Unit tests created/updated
  • 80% test code coverage (in new code)
  • Happy Path and major negative cases in HAPI tests as applicable

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions