-
Notifications
You must be signed in to change notification settings - Fork 223
fix: fixes of e2e test - Draft #2139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AnnaR-prog
wants to merge
62
commits into
main
Choose a base branch
from
fixes-e2e-test-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Test Results 6 files - 1 60 suites - 71 31m 25s ⏱️ - 2m 32s For more details on these failures, see this check. Results for commit 0fd96d8. ± Comparison against base commit 4ee4fa7. This pull request removes 1331 and adds 27 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
6e09e55 to
bc14545
Compare
…nfiguration from hot path
The heartbeat goroutine was accidentally removed. Adding it back to: - Show test is still alive and progressing (prints every second) - Detect if process freezes vs just waiting on channels Also added debug prints for: - Waiting for virtual epoch signals (1/3, 2/3, 3/3) - REST relay test progress (1/10 through 10/10) This will help identify if the hang after epoch 1 is due to: - Epoch signal not being sent - Channel blocking - Process being killed externally
Added detailed logging to diagnose why epoch 2 signal might not be sent: - Initial epoch counter and duration on startup - Sleep duration and target time before each epoch wait - Confirmation when epoch ends and signal is sent - Whether signal was received or ignored (no receiver) - Context cancellation detection - Panic detection with explicit stdout print This will show if: - Sleep duration becomes negative or very large - Signal is sent but not received - Goroutine panics or is cancelled - Goroutine stops running entirely
Added time.Sleep(100ms) before all critical debug prints in: - Epoch goroutine (before every print statement) - Main goroutine receive loop (before every print statement) This ensures the output buffer has time to flush before the actual print statement, helping us see exactly where execution stops if there's an issue with print buffering or timing. Combined with os.Stdout.Sync() after prints, this gives us maximum visibility into what's happening.
The hang is occurring BEFORE the select statement, likely at the time.Sleep(100ms) call after 'about to send signal'. Removed that specific sleep to see if it's what's causing the hang. Kept the goroutine dump for epoch 2 to understand the state. The sequence is: ✅ about to send signal for epoch 2 (printed) ❌ entering select... (never printed) This suggests the issue is in one of these operations: - epochCounter++ (unlikely) - time.Sleep(100ms) (LIKELY - timer/scheduler issue) - os.Stdout.Sync() (possible - output pipe blocked) - fmt.Printf (possible - formatting issue)
CRITICAL BUG IDENTIFIED: The select statement itself freezes the ENTIRE Go runtime: - Epoch goroutine prints 'entering select to send signal 2' - ENTIRE PROCESS FREEZES (no heartbeat, no goroutines run) - This happened BEFORE any goroutine dump was added - Even independent heartbeat goroutine stops completely This indicates the select statement configuration triggers a Go runtime bug or deadlock that blocks the scheduler itself. THE FIX: - Removed select statement entirely - Use simple direct channel send: signalChannel <- true - This will block until main receives (which is fine) - Test will be killed via defer epochCancel() when done - Avoids whatever pathological interaction was freezing runtime This is the simplest possible approach - no select, no cases, just send.
Added comprehensive debug logging with 100ms sleeps after each print: - Entry point - Lock acquisition/release - Number of commands copied - Each command being killed - Completion The sleeps ensure proper flushing in CI environment and help identify exact point if cleanup phase has issues.
Direct send (without select) FIXED the epoch signal issue - test progressed! But now hangs at killing 10_StartLavaInEmergencyMode (same as original issue). Added debugging: - Full goroutine dump before killing emergency mode process - Detailed logging around Getpgid() call - Detailed logging around Kill() syscall - 100ms sleep after each print for proper flushing This will show: - State of all goroutines when attempting kill - Whether Getpgid() completes or hangs - Whether Kill() completes or hangs - If there's a deadlock preventing the kill operation
…by ensuring error processing order
… enforcing context cancellation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…nfiguration from hot path
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...