Skip to content

Conversation

Matthew-Whitlock
Copy link
Contributor

It is possible for the subcomms created by coll/han to have a lower CID than the parent comm. If the parent is not freed by the user by the time of MPI_Finalize, the subcomms will then be freed first. The parent is freed soon after, and coll/han cleans up by now double-freeing the subcomms.

This change prevents the subcomms from being freed before the parent comm is freed.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You need to do the same trick to the communicators created via mca_coll_han_comm_create (not the _new) around line 392 in the same file. There should be 4 communicators up_comms[0], up_comms[1], low_comms[0], and low_comms[1]`.

@janjust
Copy link
Contributor

janjust commented Sep 9, 2025

@Matthew-Whitlock Can you take a look at @bosilca comment - think you can update the PR with a full fix?

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just change the HAN_SUBCOM_EXTRA_RETAIN macro from testing if the comm is already extra retained, which is should never be because this comm is not an intercomm, to testing if the comm id is lower than the parent comm (OMPI_COMM_CID_IS_LOWER((COMM), (PARENT_COMM))) and only extra retain it if the condition is true. This is the same code as in the other case where we mark a communicator as extra retained.

@Matthew-Whitlock
Copy link
Contributor Author

Got it. I also caught that there was one extra low_comm to mark.

@Matthew-Whitlock
Copy link
Contributor Author

Any other steps for getting this merged in?

@bosilca bosilca merged commit 868fff3 into open-mpi:main Sep 26, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants