Skip to content

Conversation

abbotts
Copy link
Contributor

@abbotts abbotts commented Sep 24, 2025

The full list of cray-mpich environment variables can be quite intimidating for most users. This PR is an effort to pull out the ones most users should be aware of and write them in plain text.

I'll open this as a PR because we need to iterate a bit on placement, formatting, and descriptions. There's also a few that didn't make this first cut that we might want to add. In particular, these are on the shortlist but I decided to leave out but perhaps should be added back in. I feel like if we want to add these we need a more dedicated MPI debugging page.


If indicated by profiling or counters
- `FI_MR_CACHE_MAX_COUNT` - NOT max size
- `MPICH_GPU_IPC_CACHE_MAX_SIZE`
- `FI_MR_CACHE_MONITOR`

If running complex workflows:
- `MPICH_SINGLE_HOST_ENABLED`
- `MPICH_OFI_NIC_POLICY`
    - `MPICH_OFI_NIC_VERBOSE`
    - `MPICH_OFI_NIC_MAPPING`

- `FI_CXI_RX_MATCH_MODE` - can we test how much memory this uses to start in hybrid?

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Setting this environment variable to ``1`` will spawn a thread dedicated to making progress on outstanding MPI communication and automatically increase the MPI thread level to MPI_THREAD_MULTIPLE.
Applications that use one-sided MPI (eg, ``MPI_Put``, ``MPI_Get``) or non-blocking collectives (eg, ``MPI_Ialltoall``) will likely benefit from enabling this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. My experiments with MPI_Get and MPI_Ialltoall seemed to work pretty well without the async thread. Maybe because I wasn't trying to overlap with heavy CPU-based computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, @timattox and I had a discussion on this and both my recommendation (one-sided) and his (non-blocking collectives) are based on guidance we got from Krishna, but neither of us has had a chance to really test.

I'm not sure how much CPU-computation has to do with it. I think this comes down to when progress happens, and Slingshot may change some of that. Without the offloaded rendezvous the progress would only happen in a libfabric call, and that's only going to happen from an MPI call unless you have the progress thread.

The guidance in the MPICH man page is actually more broad than what we have here. It basically says "this is good for anything except blocking pt2pt".

My inclination is leave this in for now but make a point to specifically test over the next 6 months and update with what we think the right guidance is for different codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants