-
Notifications
You must be signed in to change notification settings - Fork 20
Add support for AMD GPUs via HIP #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for AMD GPUs via HIP #116
Conversation
Update: I upgraded to ROCm/HIP 4.5.0 (from 4.2.0) and it resolved the issue with symbols. The remaining tests that fail seem to be some sort of a numerical issue. |
4cd18c3
to
33bbbbd
Compare
This seems like a worthy goal. I do not have an AMD GPU to test on - and am frankyl out of my depth - we will need others in the dev team to test! Your macro-based approach looks good, at least as a way to get HIP started with minimal code changes. It unifies the headers for the two architectures. (Note that you now #include fft and complex headers always where they weren't previously - hope this isn't the cause of math errors). To complete a PR you would want to add a doc section in README.md or a linked doc file about compiling on HIP. Basically, so everyone knows how to test what you've added. Try to make the makefile have minimal changes, and a simple flag eg Now, re math, if you can post a single test command line that fails (you say a math-output rather than crash failure?) maybe someone w/ AMD can help us out? Thanks, Alex |
I've been doing some digging and think I've found the root cause, which seems to be related to the host-side code and not to the GPU at all. My debugging shows that we're not taking the following branch because Line 31 in 46a8e8a
Line 12 in 46a8e8a
And Line 16 in 46a8e8a
To be honest, I'm not sure how this has ever worked. If you compare
You can see that I have confirmed that changing I think the potential solutions could be decreasing |
Hi @elliottslaughter , thanks for reporting this bug and the proposed solutions. This is my mistake. In FINUFFT (the cpu version), I will create a PR to fix it. |
33bbbbd
to
e1e9c8f
Compare
Hi Elliot, that's exciting re getting HIP going. Looks like pulling in 1d failed... sorry... but it's a worthwhile part of the code to include. Good luck & thanks for your collaboration on this. -A |
I'm currently working on diagnosing the issue that happens on Crusher with newer AMD hardware. Currently (as of my most recent push), the entire test suite works on Spock (with previous-generation AMD hardware). After AMD works in this branch, I'll get back to NVIDIA and the other issues identified in this thread. |
@ahbarnett I wanted to follow up to your comment #134 (review):
Do the Python tests have such checks? I ask because, so far I've been using |
They do. I would run |
This is work in progress branch to add support for AMD GPUs via HIP, a very nearly CUDA-compatible API by AMD. While this isn't ready to go yet, it's far enough along that I wanted to get feedback on the approach I'm taking.
The basic idea of HIP is to provide CUDA-compatible APIs that are identical except for naming. In most cases you can find-and-replace
cuda
forhip
and things will work.Rather than literally doing that transformation to the code (the way AMD seems to want you to do it), I'm using an approach with a header wrapper around
cuda.h
/hip/hip_runtime.h
. Then I add some translations with#define cudaX hipX
for eachX
API call used in the code. The advantage of going this route is that the vast majority of the code does not need to change at all, at least for a functionally correct version. I haven't looked at performance at all yet, and there may be more work to do there.I also made a few minor changes to the build system, mostly to provide enough configuration flexibility so that site files are sufficient to build against HIP. I provide an example site in
sites/make.inc.olcf_spock
for Spock.Status:
With this configuration, I'm able to build and run on Spock. I am testing against ROCm 4.5.0.
Running
make site=olcf_spock check
currently gives me failures in the following tests. All other tests are passing.I had to remove
pycuda
frompython/cufinufft/requirements.txt
because it doesn't exist in HIP (PyCUDA has not been ported, and I'm not sure whether it will be). I'm not sure how to deal with this at the moment since I'm not surerequirements.txt
has an optional dependency syntax.I'd appreciate a review of the basic approach I'm using, and if anyone has advice on how to address the remaining tests failures, that would be helpful. Thanks!
P.S. This may also be sufficient to get Intel GPUs working via HIPCL, though I haven't tested that.