-
Notifications
You must be signed in to change notification settings - Fork 9
CatchCN CLM51 fix endrun exit issue #1152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/jkolassa_cnclm51
Are you sure you want to change the base?
CatchCN CLM51 fix endrun exit issue #1152
Conversation
This PR is being prevented from merging because you have not added one of our required labels: 0 diff, 0 diff trivial, Non 0-diff, 0 diff structural, 0-diff trivial, Not 0-diff, 0-diff, automatic, 0-diff uncoupled, github_actions. Please add one so that the PR can be merged. |
This PR is being prevented from merging because you have added one of our blocking labels: Contingent - DNA, Needs Lead Approval, Contingent -- Do Not Approve. You'll need to remove it before this PR can be merged. |
I am not sure if it is a right fix. The optional argument " ec" and "rc" are never used in the actual calls. Does the "stop" really stop the program ? Should we remove the ASSERT and bring back the original abort? |
…CheckMod.F90, abortutils.F90, shr_abort_mod.F90)
@weiyuan-jiang : We talked with Tom, and I just pushed what I hope is a cleaner fix. If it builds, @biljanaorescanin can give it a try. In the meantime, don't hesitate to let me know if you see anything wrong. I always feel a bit lost when it comes to ESMF and MAPL |
write(iulog,*)'grainc_to_cropprodc = ',grainc_to_cropprodc(c)*dt | ||
write(iulog,*)'-1*som_c_leached = ',som_c_leached(c)*dt | ||
call endrun(msg=errMsg(sourcefile, __LINE__)) | ||
call endrun(msg=errMsg(sourcefile, __LINE__),rc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, why not just use _ASSERT(.false.) here (and below)?? In any case, we probably need to take a look at error handling throughout the code in ./CLM51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial commit doesn't build because I didn't specify "Iam" in the CTSM abort routines. It doesn't make much sense to do that. I'll fix it in a bit along the lines of doing _ASSERT instead of endrun() in the science code
I think the change is not enough. 1) When this abort is called, the program should stop running so it should not return SUCCESS. 2) We need to bubble up all the rc s in every call to get back the trace if we use MAPL's error handling. It is not hard, but there would be a lot of changes |
@weiyuan-jiang, @biljanaorescanin : I reverted "abort" to that the original CTSM code (6e02675). This builds, and it might be worth seeing if it works.
Ideally, one of the two "original" files could be used instead of the modified file. We should avoid changes from the original CTSM code if we can help it. (It doesn't help that it's not clear what the original code is...) |
I ran a sanity test to confirm that we exit when an error occurs. |
Note: The Spack CI will (eventually) work once |
NOTE: We can't merge this until Nitrogen balance error is resolved.
Issue :
During CatchCN run we see error message but run doesn't exit it proceeds and finishes successfully.
In code we have calls to endrun in right locations but down the stream final code just did assert print /log not exit.
With Branch fix:
Code stops and prints error and trace to error, example from my 1day run:
In GEOSldas_log_txt trace:
ENDRUN:
ERROR:
ERROR in CNBalanceCheckMod.F90 at line 548
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 27588 RUNNING AT borgk145
= KILLED BY SIGNAL: 9 (Killed)
In GEOSldas_err_txt trace:
No errorforrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.31.s 00001495FB25A910 Unknown Unknown Unknown
GEOSldas.x 0000000000445658 Unknown Unknown Unknown
libpthread-2.31.s 00001495FB25A910 Unknown Unknown Unknown
No errorshr_abort_abort: hard abort
shr_abort_abort: hard abort
shr_abort_abort: hard abort
@gmao-rreichle @weiyuan-jiang