Add support for AMD GPUs via HIP #116

elliottslaughter · 2021-11-12T19:39:46Z

This is work in progress branch to add support for AMD GPUs via HIP, a very nearly CUDA-compatible API by AMD. While this isn't ready to go yet, it's far enough along that I wanted to get feedback on the approach I'm taking.

The basic idea of HIP is to provide CUDA-compatible APIs that are identical except for naming. In most cases you can find-and-replace cuda for hip and things will work.

Rather than literally doing that transformation to the code (the way AMD seems to want you to do it), I'm using an approach with a header wrapper around cuda.h / hip/hip_runtime.h. Then I add some translations with #define cudaX hipX for each X API call used in the code. The advantage of going this route is that the vast majority of the code does not need to change at all, at least for a functionally correct version. I haven't looked at performance at all yet, and there may be more work to do there.

I also made a few minor changes to the build system, mostly to provide enough configuration flexibility so that site files are sufficient to build against HIP. I provide an example site in sites/make.inc.olcf_spock for Spock.

Status:

With this configuration, I'm able to build and run on Spock. I am testing against ROCm 4.5.0.

Running make site=olcf_spock check currently gives me failures in the following tests. All other tests are passing.

bin/cufinufft3d1_test 4 15 15 15 2048 1e-3
bin/cufinufft3d1_test 4 15 15 15
bin/cufinufft3d1_test_32 4 15 15 15 2048 1e-3
bin/cufinufft3d1_test_32 4 15 15 15

I had to remove pycuda from python/cufinufft/requirements.txt because it doesn't exist in HIP (PyCUDA has not been ported, and I'm not sure whether it will be). I'm not sure how to deal with this at the moment since I'm not sure requirements.txt has an optional dependency syntax.

I'd appreciate a review of the basic approach I'm using, and if anyone has advice on how to address the remaining tests failures, that would be helpful. Thanks!

P.S. This may also be sufficient to get Intel GPUs working via HIPCL, though I haven't tested that.

elliottslaughter · 2021-12-02T21:56:31Z

Update: I upgraded to ROCm/HIP 4.5.0 (from 4.2.0) and it resolved the issue with symbols. The remaining tests that fail seem to be some sort of a numerical issue.

ahbarnett · 2022-02-03T22:11:23Z

This seems like a worthy goal. I do not have an AMD GPU to test on - and am frankyl out of my depth - we will need others in the dev team to test!

Your macro-based approach looks good, at least as a way to get HIP started with minimal code changes. It unifies the headers for the two architectures. (Note that you now #include fft and complex headers always where they weren't previously - hope this isn't the cause of math errors).
Certainly better than a search-and-replace recommended at blogs like
https://www.lumi-supercomputer.eu/preparing-codes-for-lumi-converting-cuda-applications-to-hip/
!

To complete a PR you would want to add a doc section in README.md or a linked doc file about compiling on HIP. Basically, so everyone knows how to test what you've added.

Try to make the makefile have minimal changes, and a simple flag eg
make check -DHIP

Now, re math, if you can post a single test command line that fails (you say a math-output rather than crash failure?) maybe someone w/ AMD can help us out?

Thanks, Alex

elliottslaughter · 2022-02-05T00:47:38Z

I've been doing some digging and think I've found the root cause, which seems to be related to the host-side code and not to the GPU at all.

My debugging shows that we're not taking the following branch because MAX_NF is 0.

cufinufft/contrib/common.cpp

Line 31 in 46a8e8a

if (*nf<MAX_NF){ // otherwise will fail anyway

MAX_NF meanwhile is defined as (BIGINT)1e11 as you can see below (note this is master, not my branch):

cufinufft/contrib/common.h

Line 12 in 46a8e8a

    
           #define MAX_NF    (BIGINT)1e11     // too big to ever succeed (next235 takes 1s)

And BIGINT is defined as int, which is 32 bits in most implementations.

cufinufft/contrib/dataTypes.h

Line 16 in 46a8e8a

typedef int BIGINT;

To be honest, I'm not sure how this has ever worked. If you compare 1e11 and 1<<32 (which is the largest you can fit in an unsigned int, so expect about 2x smaller for signed):

>>> 1e11
100000000000.0
>>> 1<<32
4294967296

You can see that 1e11 clearly overflows. You need to get down to about 1e9 to find something that actually fits in 32 bits.

I have confirmed that changing 1e11 to 1e9 does make the test pass. So it never had anything to do with the GPU in the first place.

I think the potential solutions could be decreasing MAX_NF to be INT_MAX (or similar), or we could change BIGINT to be 64 bits if you really want to be able to represent 1e11 in that number. Either way, it's not really a HIP issue so it should probably be fixed in a separate PR.

MelodyShih · 2022-02-05T03:41:30Z

Hi @elliottslaughter , thanks for reporting this bug and the proposed solutions. This is my mistake. In FINUFFT (the cpu version), BIGINT is defined as int64_t and everything makes sense.

I will create a PR to fix it.

elliottslaughter · 2022-02-07T22:18:48Z

For what it's worth, this fix is sufficient to get the entire test suite working on Spock. I'm still seeing issues on Crusher (similar system with newer AMD hardware) that need investigating.

ahbarnett · 2022-02-09T21:36:12Z

Hi Elliot, that's exciting re getting HIP going. Looks like pulling in 1d failed... sorry... but it's a worthwhile part of the code to include. Good luck & thanks for your collaboration on this. -A

elliottslaughter · 2022-02-10T19:19:20Z

I'm currently working on diagnosing the issue that happens on Crusher with newer AMD hardware. Currently (as of my most recent push), the entire test suite works on Spock (with previous-generation AMD hardware).

After AMD works in this branch, I'll get back to NVIDIA and the other issues identified in this thread.

elliottslaughter · 2022-02-15T22:45:42Z

@ahbarnett I wanted to follow up to your comment #134 (review):

make check ... does not have official pass/fail detection; it's a matter of if the tests compile and run (no math check is actually done).

Do the Python tests have such checks? I ask because, so far I've been using make check for my AMD/HIP tests, and if they're not checked, it means I haven't actually validated the correctness. If Python does, that may be a path forward though I'll have to work around the lack of PyCUDA on AMD.

janden · 2022-02-16T06:31:08Z

Do the Python tests have such checks?

They do. I would run python3 -m pytest python/cufinufft/tests to verify correctness.

elliottslaughter force-pushed the eds/hip branch from 4cd18c3 to 33bbbbd Compare January 29, 2022 00:41

elliottslaughter mentioned this pull request Feb 5, 2022

Run make check in Jenkins #125

Merged

MelodyShih mentioned this pull request Feb 6, 2022

modify MAX_NF and temporarily remove the usage of BIGINT in cufinufft source and test files #131

Merged

elliottslaughter added 7 commits February 9, 2022 10:56

Initial work on HIP support.

3569eb2

Fix linking in HIP.

f768ea8

Fix some tests.

eda340d

Fix CUDA include path and HIP linkage.

2fefe3d

More fixing paths.

d119a0c

Site file for Crusher.

9c4f615

Port 1d files and tests to HIP.

e1e9c8f

elliottslaughter force-pushed the eds/hip branch from 33bbbbd to e1e9c8f Compare February 9, 2022 16:05

hsudarrenj and others added 4 commits June 17, 2022 15:59

Commented out tests that fail on Crusher ROCm/4.5.0

e3b75f6

Change PyCUDA dependency to PyBindGPU

70edb3a

Make compatible with new PybindGPU

b8228de

Updates for ROCm 5.4.3.

7bfa316

blackwer mentioned this pull request Feb 20, 2024

Running on AMD #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for AMD GPUs via HIP #116

Add support for AMD GPUs via HIP #116

Uh oh!

elliottslaughter commented Nov 12, 2021 •

edited

Loading

Uh oh!

elliottslaughter commented Dec 2, 2021

Uh oh!

ahbarnett commented Feb 3, 2022

Uh oh!

elliottslaughter commented Feb 5, 2022

Uh oh!

MelodyShih commented Feb 5, 2022

Uh oh!

elliottslaughter commented Feb 7, 2022 •

edited

Loading

Uh oh!

ahbarnett commented Feb 9, 2022

Uh oh!

elliottslaughter commented Feb 10, 2022

Uh oh!

elliottslaughter commented Feb 15, 2022

Uh oh!

janden commented Feb 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add support for AMD GPUs via HIP #116

Are you sure you want to change the base?

Add support for AMD GPUs via HIP #116

Uh oh!

Conversation

elliottslaughter commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliottslaughter commented Dec 2, 2021

Uh oh!

ahbarnett commented Feb 3, 2022

Uh oh!

elliottslaughter commented Feb 5, 2022

Uh oh!

MelodyShih commented Feb 5, 2022

Uh oh!

elliottslaughter commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahbarnett commented Feb 9, 2022

Uh oh!

elliottslaughter commented Feb 10, 2022

Uh oh!

elliottslaughter commented Feb 15, 2022

Uh oh!

janden commented Feb 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elliottslaughter commented Nov 12, 2021 •

edited

Loading

elliottslaughter commented Feb 7, 2022 •

edited

Loading