-
Notifications
You must be signed in to change notification settings - Fork 118
[frontier] Add concentrated list of useful cray-mpich environment variables #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Setting this environment variable to ``1`` will spawn a thread dedicated to making progress on outstanding MPI communication and automatically increase the MPI thread level to MPI_THREAD_MULTIPLE. | ||
Applications that use one-sided MPI (eg, ``MPI_Put``, ``MPI_Get``) or non-blocking collectives (eg, ``MPI_Ialltoall``) will likely benefit from enabling this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. My experiments with MPI_Get
and MPI_Ialltoall
seemed to work pretty well without the async thread. Maybe because I wasn't trying to overlap with heavy CPU-based computation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, @timattox and I had a discussion on this and both my recommendation (one-sided) and his (non-blocking collectives) are based on guidance we got from Krishna, but neither of us has had a chance to really test.
I'm not sure how much CPU-computation has to do with it. I think this comes down to when progress happens, and Slingshot may change some of that. Without the offloaded rendezvous the progress would only happen in a libfabric call, and that's only going to happen from an MPI call unless you have the progress thread.
The guidance in the MPICH man page is actually more broad than what we have here. It basically says "this is good for anything except blocking pt2pt".
My inclination is leave this in for now but make a point to specifically test over the next 6 months and update with what we think the right guidance is for different codes.
The full list of cray-mpich environment variables can be quite intimidating for most users. This PR is an effort to pull out the ones most users should be aware of and write them in plain text.
I'll open this as a PR because we need to iterate a bit on placement, formatting, and descriptions. There's also a few that didn't make this first cut that we might want to add. In particular, these are on the shortlist but I decided to leave out but perhaps should be added back in. I feel like if we want to add these we need a more dedicated MPI debugging page.