Skip to content

Conversation

@AnnaR-prog
Copy link
Contributor

…nfiguration from hot path

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Test Results

    6 files   -     1     60 suites   - 71   31m 25s ⏱️ - 2m 32s
1 844 tests  - 1 304  1 842 ✅  - 1 305  1 💤 ±0  1 ❌ +1 
1 924 runs   - 1 304  1 922 ✅  - 1 305  1 💤 ±0  1 ❌ +1 

For more details on these failures, see this check.

Results for commit 0fd96d8. ± Comparison against base commit 4ee4fa7.

This pull request removes 1331 and adds 27 tests. Note that renamed tests count towards both.
github.com/lavanet/lava/v5/protocol/rpcprovider ‑ TestResourceLimiter_SelectBucket_CUPriority/Batch_method_with_ampersands_uses_normal_bucket
github.com/lavanet/lava/v5/protocol/rpcprovider ‑ TestResourceLimiter_SelectBucket_CUPriority/Batch_method_with_high_CU_still_uses_normal_bucket
github.com/lavanet/lava/v5/protocol/rpcprovider ‑ TestResourceLimiter_SelectBucket_CUPriority/Single_ampersand_in_method_name
github.com/lavanet/lava/v5/x/conflict ‑ TestGenesis
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestListConflictVote
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestListConflictVote/ByKey
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestListConflictVote/ByOffset
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestListConflictVote/Total
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestShowConflictVote
github.com/lavanet/lava/v5/x/conflict/client/cli ‑ TestShowConflictVote/found
…
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize/Quorum_Disabled_-_Returns_Min
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize/Quorum_Disabled_-_Zero_Responses
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize/Quorum_Enabled_-_High_Response_Count
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize/Quorum_Enabled_-_Multiple_Responses
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestGetRequiredQuorumSize/Quorum_Enabled_-_Single_Response
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestNodeErrorPrioritizedOverProtocolErrors
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestNodeErrorPrioritizedOverProtocolErrors/node_error_prioritized_over_protocol_errors
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestNodeErrorPrioritizedOverProtocolErrors/node_error_prioritized_over_protocol_errors_with_quorum_selection
github.com/lavanet/lava/v5/protocol/relaycore ‑ TestNodeErrorQuorumMet
…

♻️ This comment has been updated with latest results.

@pull-request-size pull-request-size bot added size/M and removed size/L labels Dec 9, 2025
The heartbeat goroutine was accidentally removed. Adding it back to:
- Show test is still alive and progressing (prints every second)
- Detect if process freezes vs just waiting on channels

Also added debug prints for:
- Waiting for virtual epoch signals (1/3, 2/3, 3/3)
- REST relay test progress (1/10 through 10/10)

This will help identify if the hang after epoch 1 is due to:
- Epoch signal not being sent
- Channel blocking
- Process being killed externally
@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 9, 2025
Added detailed logging to diagnose why epoch 2 signal might not be sent:
- Initial epoch counter and duration on startup
- Sleep duration and target time before each epoch wait
- Confirmation when epoch ends and signal is sent
- Whether signal was received or ignored (no receiver)
- Context cancellation detection
- Panic detection with explicit stdout print

This will show if:
- Sleep duration becomes negative or very large
- Signal is sent but not received
- Goroutine panics or is cancelled
- Goroutine stops running entirely
Added time.Sleep(100ms) before all critical debug prints in:
- Epoch goroutine (before every print statement)
- Main goroutine receive loop (before every print statement)

This ensures the output buffer has time to flush before the actual
print statement, helping us see exactly where execution stops if
there's an issue with print buffering or timing.

Combined with os.Stdout.Sync() after prints, this gives us maximum
visibility into what's happening.
The hang is occurring BEFORE the select statement, likely at the
time.Sleep(100ms) call after 'about to send signal'.

Removed that specific sleep to see if it's what's causing the hang.
Kept the goroutine dump for epoch 2 to understand the state.

The sequence is:
✅ about to send signal for epoch 2 (printed)
❌ entering select... (never printed)

This suggests the issue is in one of these operations:
- epochCounter++ (unlikely)
- time.Sleep(100ms) (LIKELY - timer/scheduler issue)
- os.Stdout.Sync() (possible - output pipe blocked)
- fmt.Printf (possible - formatting issue)
CRITICAL BUG IDENTIFIED:
The select statement itself freezes the ENTIRE Go runtime:
- Epoch goroutine prints 'entering select to send signal 2'
- ENTIRE PROCESS FREEZES (no heartbeat, no goroutines run)
- This happened BEFORE any goroutine dump was added
- Even independent heartbeat goroutine stops completely

This indicates the select statement configuration triggers a Go runtime bug
or deadlock that blocks the scheduler itself.

THE FIX:
- Removed select statement entirely
- Use simple direct channel send: signalChannel <- true
- This will block until main receives (which is fine)
- Test will be killed via defer epochCancel() when done
- Avoids whatever pathological interaction was freezing runtime

This is the simplest possible approach - no select, no cases, just send.
Added comprehensive debug logging with 100ms sleeps after each print:
- Entry point
- Lock acquisition/release
- Number of commands copied
- Each command being killed
- Completion

The sleeps ensure proper flushing in CI environment and help
identify exact point if cleanup phase has issues.
Direct send (without select) FIXED the epoch signal issue - test progressed!
But now hangs at killing 10_StartLavaInEmergencyMode (same as original issue).

Added debugging:
- Full goroutine dump before killing emergency mode process
- Detailed logging around Getpgid() call
- Detailed logging around Kill() syscall
- 100ms sleep after each print for proper flushing

This will show:
- State of all goroutines when attempting kill
- Whether Getpgid() completes or hangs
- Whether Kill() completes or hangs
- If there's a deadlock preventing the kill operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants