Skip to content

Conversation

biljanaorescanin
Copy link
Contributor

NOTE: We can't merge this until Nitrogen balance error is resolved.

Issue :
During CatchCN run we see error message but run doesn't exit it proceeds and finishes successfully.

In code we have calls to endrun in right locations but down the stream final code just did assert print /log not exit.

With Branch fix:
Code stops and prints error and trace to error, example from my 1day run:
In GEOSldas_log_txt trace:
ENDRUN:
ERROR:
ERROR in CNBalanceCheckMod.F90 at line 548

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 27588 RUNNING AT borgk145
= KILLED BY SIGNAL: 9 (Killed)

In GEOSldas_err_txt trace:
No errorforrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread-2.31.s 00001495FB25A910 Unknown Unknown Unknown
GEOSldas.x 0000000000445658 Unknown Unknown Unknown
libpthread-2.31.s 00001495FB25A910 Unknown Unknown Unknown
No errorshr_abort_abort: hard abort
shr_abort_abort: hard abort
shr_abort_abort: hard abort

@gmao-rreichle @weiyuan-jiang

@biljanaorescanin biljanaorescanin added the bug Something isn't working label Sep 4, 2025
Copy link

github-actions bot commented Sep 4, 2025

This PR is being prevented from merging because you have not added one of our required labels: 0 diff, 0 diff trivial, Non 0-diff, 0 diff structural, 0-diff trivial, Not 0-diff, 0-diff, automatic, 0-diff uncoupled, github_actions. Please add one so that the PR can be merged.

@biljanaorescanin biljanaorescanin changed the title fix endrun exit issue CatchCN CLM51 fix endrun exit issue Sep 4, 2025
@biljanaorescanin biljanaorescanin added the Contingent - DNA These changes are contingent on other PRs (DNA=do not approve) label Sep 4, 2025
Copy link

github-actions bot commented Sep 4, 2025

This PR is being prevented from merging because you have added one of our blocking labels: Contingent - DNA, Needs Lead Approval, Contingent -- Do Not Approve. You'll need to remove it before this PR can be merged.

@weiyuan-jiang
Copy link
Contributor

I am not sure if it is a right fix. The optional argument " ec" and "rc" are never used in the actual calls. Does the "stop" really stop the program ? Should we remove the ASSERT and bring back the original abort?

…CheckMod.F90, abortutils.F90, shr_abort_mod.F90)
@gmao-rreichle
Copy link
Contributor

I am not sure if it is a right fix. The optional argument " ec" and "rc" are never used in the actual calls. Does the "stop" really stop the program ? Should we remove the ASSERT and bring back the original abort?

@weiyuan-jiang : We talked with Tom, and I just pushed what I hope is a cleaner fix. If it builds, @biljanaorescanin can give it a try. In the meantime, don't hesitate to let me know if you see anything wrong. I always feel a bit lost when it comes to ESMF and MAPL

write(iulog,*)'grainc_to_cropprodc = ',grainc_to_cropprodc(c)*dt
write(iulog,*)'-1*som_c_leached = ',som_c_leached(c)*dt
call endrun(msg=errMsg(sourcefile, __LINE__))
call endrun(msg=errMsg(sourcefile, __LINE__),rc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, why not just use _ASSERT(.false.) here (and below)?? In any case, we probably need to take a look at error handling throughout the code in ./CLM51

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial commit doesn't build because I didn't specify "Iam" in the CTSM abort routines. It doesn't make much sense to do that. I'll fix it in a bit along the lines of doing _ASSERT instead of endrun() in the science code

@weiyuan-jiang
Copy link
Contributor

I think the change is not enough. 1) When this abort is called, the program should stop running so it should not return SUCCESS. 2) We need to bubble up all the rc s in every call to get back the trace if we use MAPL's error handling. It is not hard, but there would be a lot of changes

@gmao-rreichle
Copy link
Contributor

@weiyuan-jiang, @biljanaorescanin : I reverted "abort" to that the original CTSM code (6e02675). This builds, and it might be worth seeing if it works.
However, I'm really confused about shr_sys_mod.F90:

  1. Jana's version of this file comments out almost everything. For example, subroutine shr_sys_system() (to make. a system call) has been commented out. But this subroutine is called in shr_file_mod.F90. I don't see why this builds. I must be missing something. (This issue has nothing to do with the "abort" problem addressed in the PR.)
  2. Per Jana's README_CN51.md, the original files are from CTSM tag branch_tags/PPE.n08_ctsm5.1.dev023. But the 'shr_sys_mod.F90' file from this tag is different from the file in ./CLM51_orig_files.

Ideally, one of the two "original" files could be used instead of the modified file. We should avoid changes from the original CTSM code if we can help it. (It doesn't help that it's not clear what the original code is...)

@biljanaorescanin
Copy link
Contributor Author

I ran a sanity test to confirm that we exit when an error occurs.
Once we’re ready, this will be good to go.

@mathomp4
Copy link
Member

Note: The Spack CI will (eventually) work once develop is merged into this branch (or it's parent branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Contingent - DNA These changes are contingent on other PRs (DNA=do not approve)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants