-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8303612: runtime/StackGuardPages/TestStackGuardPagesNative.java fails with exit code 139 #25689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… with exit code 139 replace infinite loop with conditional in do_overflow function
Hi @mz1999, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user mz1999" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing |
❗ This change is not yet ready to be integrated. |
The copyright year should be updated from 2023 to 2025. |
- Change copyright year from 2023 to 2025 in the test file header
Done. Updated the copyright year to 2025 as requested. Thanks! |
/signed |
Thank you! Please allow for up to two weeks to process your OCA, although it is usually done within one to two business days. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated! |
@mz1999 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
/touch |
@mz1999 The pull request is being re-evaluated and the inactivity timeout has been reset. |
Hi @jdksjolen , Following up on this. As per the bot's message two weeks ago, I've been waiting for my OCA to be processed, but it seems to be stuck. My Oracle Account Email: [email protected] Any help would be greatly appreciated. Thank you! |
@mz1999 the forever-loop was added as part of the "hardening" changes by JDK-8295344, so it seems you are saying that introduced a new bug? How does your proposed change compare to the code that we had prior to the "hardening"? I just want to try and see what the original problem with this code was. Thanks |
Hi! I've never seen a CentOS 7 error nor an Ubuntu error, where does this come from? This issue was created before the forever-loop was introduced, where we had this: void do_overflow(){
int *p = alloca(sizeof(int));
if (_kp_rec_count == 0 || _rec_count < _kp_rec_count) {
_rec_count ++;
do_overflow();
} But that's the, morally, same code as you've added back. So, did the 'hardening' fix the original bug, but introduce a new bug? Isn't exit code 139 a SIGSEGV, so does this change a different type of bug? Thanks. |
Ping @mz1999 |
Hi @dholmes-ora and @jdksjolen , Apologies for the delayed response. Thank you for the comments. I wasn't aware of the historical context behind this test and the changes from JDK-8295344. My investigation started simply because I observed that the test behaved differently on two platforms. To help clarify what I'm seeing, let me walk you through my testing process with the original, unmodified I reverted my local changes and ran the test on both Ubuntu 24 and CentOS 7. Test Environments
Test on Ubuntu 24.04 LTSI ran the test using the following command to retain all test artifacts: make test TEST="test/hotspot/jtreg/runtime/StackGuardPages/TestStackGuardPagesNative.java" JTREG="RETAIN=all" The test passed successfully. The
To understand how it passed, I examined the native process log for the
The log shows that the second Of course, this is just my interpretation, and I may be mistaken. From my reading, the test is designed as a two-phase process:
A true "pass" for this verification should therefore correspond to the } else if (_last_si_code == -1) {
printf("Test PASSED. No stack guard page is present. Maximum recursion level reached at %d\n", _rec_count);
} However, the current test passes on Ubuntu not for this reason, but because the second overflow attempt is coincidentally terminated by a different signal, a 2. Test on CentOS 7 (Hangs and Times Out)Running the same test on CentOS 7 resulted in a timeout after 480 seconds. The
I checked the test artifacts. The log for the
However, the log for the failing Moving ForwardYour comments have been incredibly helpful. They've made me realize that this issue is far more complex than I first understood. My initial PR was focused squarely on fixing the hang symptom I observed on CentOS 7. I will need to do some further investigation to fully understand the original problem that the for(;;) loop was intended to solve. Thank you again for your guidance and patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fix addresses your CentOS timeout issue by ensuring that the overflow loop will terminate even if no SEGV is encountered. IIUC "normally" the peek would eventually fault when we hit the libc guard page, but on CentOS this may not be the case and the region beyond the current stack may be accessible and so not fault.
However, I don't see how this fix addresses the other failure modes that have been reported in this issue.
Hi @dholmes-ora and @jdksjolen , Thank you so much for your insightful comments and for pointing me to the relevant JBS issues. After studying the history you pointed me to in JDK-8295344 and JDK-8293452, I think I have a better understanding of the evolution of this test and the critical reasoning behind the "hardening" fix. Here’s my understanding of the test's journey, which I hope clarifies my intention with this PR. The Original Recursive ImplementationThe original implementation was recursive, which looked something like this: // Generation 1: Recursive and Bounded
void do_overflow(){
int *p = alloca(sizeof(int));
if (_kp_rec_count == 0 || _rec_count < _kp_rec_count) {
_rec_count ++;
do_overflow();
}
} My reading of The "Hardening" Fix (Iterative but Unbounded)The "hardening" fix in // Generation 2: Iterative but Unbounded
void do_overflow(){
volatile int *p = NULL;
if (_kp_rec_count == 0 || _rec_count < _kp_rec_count) {
for(;;) {
_rec_count++;
p = (int*)alloca(128);
_peek_value = p[0]; // Peek
}
}
} While this successfully eliminated the stack corruption from My Proposed Change (Iterative and Bounded)My proposed change is an evolution of this iterative approach, not a regression to recursion. It retains the iterative nature of the hardening fix but adds a boundary condition to the loop: // Generation 3 (My PR): Iterative and Bounded
void do_overflow(){
volatile int *p = NULL;
while (_kp_rec_count == 0 || _rec_count < _kp_rec_count) {
_rec_count++;
p = (int*)alloca(128);
_peek_value = p[0]; // Peek
}
} It retains the crucial iterative approach from I believe this approach resolves both the original crash and the subsequent hang. Thank you again for your guidance and patience. |
I agree this continues to fix the problem seen with the recursive approach. I also agree it should fix the CentOS hang. However, I do not see how this will address the actual failure that is reported in JBS for this issue: Testing NATIVE_OVERFLOW The JBS connections and timeline is very hard to follow, but I believe this failure was seen after the hardening fix had been applied. |
Thank you for your patience and for pointing out the discrepancy. You were absolutely right to question how my PR addressed the failure mode in the JBS issue. After re-evaluating everything, I realize I have made a mistake and confused two separate issues. I sincerely apologize for the noise and for taking up your valuable time. Let me clarify what happened. The JBS issue associated with this PR, However, the problem I have been trying to solve is a timeout hang that occurs specifically in the My investigation started when I encountered this hang while running the To confirm this, I re-ran the tests from the jdk17u repository on my CentOS 7 environment today. The relevant
The logs confirm that the This is fundamentally different from the teardown crash described in Again, my apologies for the confusion I've caused. I will take some more time to reconsider the best path forward. If I cannot find a way to contribute that cleanly addresses a well-defined issue, it would be better to maintain the status quo. Thank you again for your time and guidance. |
I would suggest filing a new bug for the CentOS hang and create a new PR of the change here, with that new bug ID. |
Thank you, that plan sounds great. I don't yet have an account on bugs.openjdk.org to file the new bug. Would you mind creating the JBS issue for this hang? I've drafted all the necessary information below to help. Suggested JBS Issue ContentTitle: Description: The This issue is a regression introduced by the "hardening" fix in However, this unbounded loop relies on an incidental The proposed solution is to make the iterative loop bounded by checking against the previously recorded overflow depth ( Once the new bug ID is available, I will close this current PR and open a new one with the fix, linked to the new ID, as you suggested. |
I have created a new issue https://bugs.openjdk.org/browse/JDK-8366787 |
Hi @dholmes-ora and @sendaoYan, Thank you so much, @sendaoYan, for creating the JBS issue for me! I really appreciate your help in moving this forward. As suggested by @dholmes-ora, I am now closing this PR because it was linked to the incorrect JBS issue. I have followed the advice and taken the following steps:
The new PR is available here: #27114 Thank you again for all your support in getting this on the right track. All future discussion should please continue on the new PR. |
This pull request addresses an issue in
runtime/StackGuardPages/TestStackGuardPagesNative
where the native test component (exeinvoke.c
) exhibited platform-dependent behavior and did not fully align with the intended test objectives for verifying stack guard page removal on thread detachment.Summary of the Problem:
The
test_native_overflow_initial
scenario withinTestStackGuardPagesNative
showed inconsistent results:SEGV_MAPERR
to terminate a loop that should have had a defined exit condition.The core issue was that the native code's second stack overflow attempt, designed to check for guard page removal, used an unbounded loop. Its termination (and thus the test's outcome) depended on platform-specific OS behavior regarding extensive stack allocation after guard pages are supposedly modified.
Test Objective Analysis:
The primary goal of
TestStackGuardPagesNative
, particularly for the initial thread (test_native_overflow_initial
), is to:SIGSEGV
withsi_code = SEGV_ACCERR
, indicating an active stack guard page.DetachCurrentThread()
, confirm that the previously active stack guard page is no longer enforcing the same strict protection. This is ideally demonstrated by successfully allocating stack space up to the depth that previously caused theSEGV_ACCERR
, without encountering any signal.How the Original Implementation Deviated from the Test Intent:
The native
do_overflow
function, when invoked for the second phase (to check guard page removal), implemented an unconditionalfor(;;)
loop.SEGV_ACCERR
). However, the unbounded loop meant:SEGV_MAPERR
was possible, this loop ran for an excessive duration, leading to a hang.SEGV_MAPERR
. The existing result-checking logic inrun_native_overflow
incorrectly treated thisSEGV_MAPERR
as a "pass" condition for guard page removal, which is misleading. The true indication of successful guard page removal is the absence of any signal during the controlled second allocation phase.This reliance on incidental, platform-dependent signal behavior to terminate the loop (and the misinterpretation of
SEGV_MAPERR
as a success) meant the test was not robustly verifying its core objective.Proposed Solution:
This PR modifies the
do_overflow
function inexeinvoke.c
to accurately reflect the test intent for both phases of stack overflow checking, ensuring deterministic behavior:The
do_overflow
function is updated to use a singlewhile
loop whose condition correctly handles both the initial overflow check and the subsequent verification of guard page removal:Impact and Clarity of Test Outcome with This Change:
do_overflow
. The loop now has a defined boundary based on_kp_rec_count
. This directly resolves the hang observed on platforms like CentOS 7.do_overflow
: Thedo_overflow
function will now consistently:SIGSEGV
occurs andlongjmp
is called._kp_rec_count
times. If noSIGSEGV
occurs, it will complete this bounded loop and return normally. If aSIGSEGV
does occur within these_kp_rec_count
allocations,longjmp
will be called.This refinement ensures
TestStackGuardPagesNative
more accurately and reliably verifies the stack guard page lifecycle for native threads attached to the JVM.Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25689/head:pull/25689
$ git checkout pull/25689
Update a local copy of the PR:
$ git checkout pull/25689
$ git pull https://git.openjdk.org/jdk.git pull/25689/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25689
View PR using the GUI difftool:
$ git pr show -t 25689
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25689.diff
Using Webrev
Link to Webrev Comment