[frontier] Add concentrated list of useful cray-mpich environment variables #1002

abbotts · 2025-09-24T20:38:23Z

The full list of cray-mpich environment variables can be quite intimidating for most users. This PR is an effort to pull out the ones most users should be aware of and write them in plain text.

I'll open this as a PR because we need to iterate a bit on placement, formatting, and descriptions. There's also a few that didn't make this first cut that we might want to add. In particular, these are on the shortlist but I decided to leave out but perhaps should be added back in. I feel like if we want to add these we need a more dedicated MPI debugging page.


If indicated by profiling or counters
- `FI_MR_CACHE_MAX_COUNT` - NOT max size
- `MPICH_GPU_IPC_CACHE_MAX_SIZE`
- `FI_MR_CACHE_MONITOR`

If running complex workflows:
- `MPICH_SINGLE_HOST_ENABLED`
- `MPICH_OFI_NIC_POLICY`
    - `MPICH_OFI_NIC_VERBOSE`
    - `MPICH_OFI_NIC_MAPPING`

- `FI_CXI_RX_MATCH_MODE` - can we test how much memory this uses to start in hybrid?

systems/frontier_user_guide.rst

trey-ornl · 2025-09-24T23:10:04Z

systems/frontier_user_guide.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Setting this environment variable to ``1`` will spawn a thread dedicated to making progress on outstanding MPI communication and automatically increase the MPI thread level to MPI_THREAD_MULTIPLE.
+Applications that use one-sided MPI (eg, ``MPI_Put``, ``MPI_Get``) or non-blocking collectives (eg, ``MPI_Ialltoall``) will likely benefit from enabling this feature.


Interesting. My experiments with MPI_Get and MPI_Ialltoall seemed to work pretty well without the async thread. Maybe because I wasn't trying to overlap with heavy CPU-based computation?

So, @timattox and I had a discussion on this and both my recommendation (one-sided) and his (non-blocking collectives) are based on guidance we got from Krishna, but neither of us has had a chance to really test.

I'm not sure how much CPU-computation has to do with it. I think this comes down to when progress happens, and Slingshot may change some of that. Without the offloaded rendezvous the progress would only happen in a libfabric call, and that's only going to happen from an MPI call unless you have the progress thread.

The guidance in the MPICH man page is actually more broad than what we have here. It basically says "this is good for anything except blocking pt2pt".

My inclination is leave this in for now but make a point to specifically test over the next 6 months and update with what we think the right guidance is for different codes.

systems/frontier_user_guide.rst

[frontier] Rough draft of cray-mpich environment variables

edb07d4

trey-ornl reviewed Sep 24, 2025

View reviewed changes