Skip to content

Conversation

JBlaschke
Copy link
Contributor

@JBlaschke JBlaschke commented Oct 25, 2020

Following the discussing in #67, I implemented some basic multi-GPU support. The strategy with cufinufft has been to:

  1. Add a cuda_device_id to the cufinufft_opts struct
  2. All top-level cufinufft functions that either take the options (or a copy of the options in the plan) respect this setting:
    1. CUFINUFFT_MAKEPLAN
    2. CUFINUFFT_SETPTS
    3. CUFINUFFT_EXECUTE
    4. CUFINUFFT_DESTROY
    5. everything in memtransfer_wrapper.cu

In theory, we can enable more fine-grained multi-gpu control, but this is more than enough (the cudaSetDevice calls in memtransfer_wrapper.cu are probably overkill -- but I tend not to take chances with memory) for data-parallel workflows. Here is an example of a data-parallel workflow in python:

#_______________________________________________________________________________
# Initialize CUDA Driver
#

import atexit
import pycuda
import pycuda.driver    as drv
from   pycuda.gpuarray import GPUArray, to_gpu
from   mpi4py          import MPI

drv.init()

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
atexit.register(MPI.Finalize)

dev = drv.Device(rank)
ctx = dev.make_context()

atexit.register(ctx.pop)

DEV_ID = rank;

#---------------------------------------------------------------------------


H_gpu = to_gpu(H_)
K_gpu = to_gpu(K_)
L_gpu = to_gpu(L_)
ugrid_gpu = to_gpu(ugrid.astype(complex_dtype))

# Allocate space on Device
nuvect_gpu = GPUArray(shape=(N,), dtype=complex_dtype)

#___________________________________________________________________________
# Solve the NUFFT
#

plan = cufinufft(2, shape, -1, tol, dtype=dtype, opts=opts, gpu_method=1, gpu_device_id=DEV_ID)
plan.set_pts(H_gpu, K_gpu, L_gpu)
plan.execute(nuvect_gpu, ugrid_gpu)

#
#---------------------------------------------------------------------------

nuvect = nuvect_gpu.get()

I can package some basic test scripts based in this workflow if all y'all are interested, but I wanted to solicit some feedback first.

This works nicely on NERSC's Cori-GPU. Gonna try OLCF Summit next, and maybe our DGX cluster.

@JBlaschke
Copy link
Contributor Author

Please let me know if you need anything more complex. We're controlling our workflow from python => I wanted to keep this PR as minimal as possible to not cause any problems upstream.

This was referenced Oct 25, 2020
@janden
Copy link
Collaborator

janden commented Nov 6, 2020

Thank you for putting this together .I'm sorry we haven't gotten around to reviewing this. The past two weeks have been very busy on my end but I'll try to get through this sometime next week.

@JBlaschke
Copy link
Contributor Author

@janden No worries! I have the same problem (big project review overlapping with SC20). Let me know if you have any questions.

Copy link
Collaborator

@janden janden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty straightforward. My main worry is about state. If we go around calling cudaSetDevice, is that going to mess up other CUDA libraries? Or is it up to each library to make sure it's operating on the right device?

@JBlaschke
Copy link
Contributor Author

feel free to make those changes (you should have write access to my repo) -- or let me know and I'll implement them myself @janden

@janden
Copy link
Collaborator

janden commented Jan 22, 2021

Ok I'll make the changes. What about cudaSetDevice? Is it ok for us to go around setting that?

@JBlaschke
Copy link
Contributor Author

Ok I'll make the changes. What about cudaSetDevice? Is it ok for us to go around setting that?

You me like we do here 6071e25#diff-381d9fc71fd414e7c4ba1a28764788aa8aa95c5f971797d9219da10e213b9400R104 ?

Short answer: yes

Longer answer: I might be using it slightly too much (it's fine to call once per process) so we can call it during the plan creation only. But I am a little paranoid about these things, so I tend to want to call it every time I enter a library call that needs it.

@JBlaschke
Copy link
Contributor Author

... the situation that I am paranoid about is if 2 libraries set it to different devices. The way it's now, all functions called by the python API would address the correct device.

Something to think about is to query the device ID and reset it at the end of the top-level function calls. This way we won't be interfering with libraries that are not as paranoid.

@janden
Copy link
Collaborator

janden commented Jan 22, 2021

Right, that last part is my concern. Is the default convention that all libraries should be paranoid or should the be courteous and reset the device ID after they're done?

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Jan 22, 2021

I was initially worried about performance. Now that I've measured it, and I see that cudaSetDevice isn't measurably slowing anything down, I think we should be paranoid.

I.e. libraries should not change the state of the device for any other libraries.

I suggest we wrap all the functions that set the device as follows:

int orig_dev = cudaGetDevice();

//...

cudaSetDevice(orig_dev);

@janden
Copy link
Collaborator

janden commented Jan 22, 2021

Sounds wonderful. Please try the new interface and see how it fits with your code. If you can, it would also be a good idea to run pytest on a multi-GPU machine. It should automatically detect it and run the test_multi unit test for those.

@JBlaschke
Copy link
Contributor Author

Will do -- just waiting for my Cori GPU allocation for testing

@JBlaschke
Copy link
Contributor Author

just to confirm @janden you haven't made the changes to the options api elsewhere?

@janden
Copy link
Collaborator

janden commented Jan 22, 2021

just to confirm @janden you haven't made the changes to the options api elsewhere?

I rebased onto the new master (which includes the changes from #78). As a result, you no longer specify the opts structure as an argument but supply kwargs for those options. So plan creation becomes
plan = cufinufft(1, shape, eps=tol, dtype=dtype, gpu_device_id=device_id)
You can look at test_multi.py for a complete example.

@JBlaschke
Copy link
Contributor Author

Thanks!

@JBlaschke
Copy link
Contributor Author

Hey @janden -- just wanted to quickly check with you that, even though I'm doing this:

plan = cufinufft(2, shape, -1, tol, dtype=dtype, gpu_method=1, gpu_device_id=DEV_ID)

and printf shows that opts->gpu_device_id stays as 0 in the cuda code -- am I missing something here?

@JBlaschke
Copy link
Contributor Author

ok found the problem! It was me (quel suprize!). I had a cuda_device_id -- instead of gpu_device_id. Note that this did NOT trip an AttributeError. Probably because assigning non-existent keys to dicts is not invalid code but results in the key being created (fun fact! this has been rated as one of the biggest things that trips up new devs at facebook)

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jan 23, 2021 via email

@JBlaschke
Copy link
Contributor Author

hrm... my latest commit wasn't right either ... lemme try something else

@JBlaschke
Copy link
Contributor Author

ok -- i always feel proud if i can use map in python

@JBlaschke
Copy link
Contributor Author

@janden this version works on Cori GPU -- I tested with 16 ranks on 8 GPUs

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jan 23, 2021 via email

@JBlaschke
Copy link
Contributor Author

Oh @ahbarnett that's mainly because a common strategy for latency hiding is for one thread to launch kernels while another does memcpy -- In python true multi-threading is a bit of a can of worms, so a common strategy is to have mpi4py provide the multi-threading paradigm. If the ration of communication to compute was higher, then we might even get some more out of the device is we had >1 ranks doing work on it.

@janden
Copy link
Collaborator

janden commented Jan 23, 2021

Now I'm confused. I thought self.opts was an object with attributes, not a dictionary. Even so, shouldn't hasattr work? I don't see why you need this map stuff…

@janden
Copy link
Collaborator

janden commented Jan 23, 2021

Oh I see now. It's a ctypes thing. The variable _fields_ lists the fields and types that the struct represents (and which get mapped to the C struct when passed through ctypes). So we're checking against that list of fields. Did I understand that correctly? I'm going to refactor this a bit to make that more clear.

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Jan 23, 2021

In python the setattr behaves like a dict assignment operator -- or like a std::map assignment operator in C++. Eg:

class Basic(object):
    pass

b = Basic()
setattr(b, "z", 1)
print(b.__dict__)

this will always output {'z': 1} -- even though we never define the z attribute => it comes into existence once we try to set it.

@JBlaschke
Copy link
Contributor Author

The ctypes.Structure doesn't set attributes anyway -- hence we need to check the field names. The map is the most elegant way to do this in my opinion -- it doesn't iterate over all field elements, but defines a function that returns only the field names as the list is checked.

Your way here: 918d45d#diff-55a88bcb0ebecf4899b36b07fdea411828b2ac581559f49dd7d494b8896940f6R108 does essentially the same thing at the cost of extra memory.

@JBlaschke
Copy link
Contributor Author

FTR: the loop comprehension in your code might also be harder to optimize -- I know that python's map and filter functions are implemented in C. I suspect the loop comprehension isn't -- it would be wonderful if it was. So while your changes don't make much of a difference here, if you had a really long list, then the map would be more efficient than the loop.

@janden
Copy link
Collaborator

janden commented Jan 23, 2021

Eh for speed it depends (see https://stackoverflow.com/questions/1247486/list-comprehension-vs-map). In our case, we have

In [7]: %timeit 'foo' in [name for name, _ in opts._fields_]
883 ns ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit 'foo' in map(lambda x: x[0], opts._fields_)
1.74 µs ± 38.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For memory, I agree. Still, it's a one-time cost so I'd rather optimize for readability here. Although one could (I wouldn't) argue that map is more readable than a list comprehension…

janden and others added 7 commits January 23, 2021 04:48
Typo from before we changed the interface.
Let the user specify `gpu_device_id` when creating the plan.
This way if other libraries depend on the device ID staying the same, we
won't interfere.
The destructor expects to see `self.plan` so let's make sure it's there
in case something goes wrong.
Previously `setattr` was used, which does not error if you try to assign
to a non-existent key. Instead it creates a new entry, which is not what
we want.
If more than one GPU is available, tests the multi-GPU interface on all
available GPUs.
If we provide an invalid option when constructing the plan, we should
get an error so let's verify that.
@JBlaschke
Copy link
Contributor Author

I don't want to start a long debate about the pros and cons of map and loop comprehensions here. But the SO link is misleading because it skips over the fact that the list comprehension performs way better when there are many elements to search over:

python -mtimeit -s'xs=range(1000)' '100 in map(lambda x: x+2, xs)'
20000 loops, best of 5: 10.2 usec per loop
python -mtimeit -s'xs=range(1000)' '100 in [x+2 for x in xs]'
5000 loops, best of 5: 64.4 usec per loop

This is because map doesn't iterate over the whole list -- and once we find our element, it stops applying the lambda. Anyway I just wanted to make sure nobody walks away feeling that map doesn't have a scaling advantage.

Anyway, I'm not suggesting that you put map back.

@janden
Copy link
Collaborator

janden commented Jan 23, 2021

Huh. That makes a lot of sense.

Anyhow I think we're in agreement. I've squashed and rebased (I hope you're not too offended that the list comprehension made it into your commit) and we're ready for merge once Jenkins passes.

@JBlaschke
Copy link
Contributor Author

I've squashed and rebased (I hope you're not too offended that the list comprehension made it into your commit) and we're ready for merge once Jenkins passes.

🎉 Woohoo -- time to get a beer! (to celebrate the feature, not bemoan a short-lived map)

@janden
Copy link
Collaborator

janden commented Jan 23, 2021

tada Woohoo -- time to get a beer! (to celebrate the feature, not bemoan a short-lived map)

Indeed. Have a good weekend!

@janden janden merged commit db27003 into flatironinstitute:master Jan 23, 2021
@garrettwrong
Copy link
Collaborator

One concern is that I think you might want to pass around contexts instead of device ids. Generally a context is/becomes associated with a device, so mostly equivalent to what was done... However, if some code (outside cufinufft) changes the context on you... and that can happen for a lot of reasons.... it really doesn't matter what devID is provided, the virtual memory space is not the same under different contexts. If you know the context you can potentially get back there... knowing the device is not enough (unless things have changed recently).

The setting device stuff is all happening inside some context. In sophisticated applications you might need to know both things, but I am going to assume the simplest 1process-1thread-1context-1device situation for us (our code). Otherwise, we're out of scope.

With that said, I actually am not sure we needed to do any device setting or resetting if the context is already created and bound to a specified device. (but it has been a little while...) That is, the cuda code changes might not have been required at all if the contexts/devices are all to be managed by calling code anyway, as it is being done. You would want this logic for contexts if you ever decide to change it.

That's the main corner case I think you might see trying to mix in with other potentially multi gpu code. You might wait to see if it is an issue for other users, as I do not think this affects the current work.

Worth noting the example unit test is not actually running concurrently on multiple gpus. It is sequentially using different gpus. To do the former would require multi threading or processing at some level. Should still be good enough to check most of the machinery.

Thanks for putting it together!

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Jan 25, 2021

Hi @garrettwrong -- that's a good observation, but it's overkill.

I think there might be a little bit of confusion here: cuda contexts are a concept that is only available to the cuda driver and are not available to the cuda runtime API. The application as it stands now uses the cuda runtime API and not the driver, so managing contexts is not possible (and probably not necessary)

My reading of the runtime API and driver spec is that the cuda runtime API manages its own context independently of the driver context stack (it has 1 context -- the primary context). This is apparent when you create a cuda driver context, and set this to device N, the cuda runtime API calls (i.e. the ones prefaced with cuda instead of cu) will still use their default device.

That is to say, when you call cudaSetDevice, then you're setting the device in the runtime API's context, and not touching any other driver contexts (other than the primiary context). You can try this with a simple test case: set a device context to use device id=1 (using cuCtxCreate), then query the device using the cuda runtime (using cudaGetDevice). You'll find the runtime device ID is 0 => unaffected by cuCtxCreate.

The only caveat is where users are managing the device context used by the cuda runtime API (i.e. the primary context), then all of this falls apart. To be honest though, if a user is messing with the primary context, then they should be expected to know what they are doing. If you want to go down the road of manual context management, then this library has to be re-written to use the cu* (and not cuda*) functions. This is possible, but I think that would be overkill.

@JBlaschke
Copy link
Contributor Author

For reference, have a look at this: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts

... it is best to avoid multiple contexts per GPU within the same CUDA application. To assist with this, the CUDA Driver API provides methods to access and manage a special context on each GPU called the primary context. These are the same contexts used implicitly by the CUDA Runtime when there is not already a current context for a thread.

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Jan 26, 2021

I realize that what I said might not be right when a context has already been created on the device ... I.e. when the user calls cuCtxCreate before any cuda* function. This seems very subtle, but the primary context only seems to be created if there isn't a context on the device already: https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html

There are lots of edge cases here that I think this version works for -- but the main usage models that I come across in multi-gpu applications are:

  1. The only contexts on the device are user-managed. I.e. the cuda runtime hasn't created its primary context before the user created their contexts.
  2. The user only employs the cuda runtime. I.e. the only contexts are the primary contexts on each device.

If the user uses some thing between 1 and 2 -- i.e. some components use the driver, while other components use the runtime -- then our approach should be safe:

  • we restore the device ID after running the cufinufft functions => we won't break the user's pre-existing contexts
  • we cudaMalloc when the plan/nupts are created and we cudaFree after the plan is destroyed => at least with responsible use, we won't pollute any user-managed contexts

Fun fact: I'm using cufinufft in an application that does mix and match 1 & 2 (PyCUDA components using the driver, and pybind11 components using the API). So far the wheels haven't come off :)

The only places where I can see this falling apart is when the user changes context between plan creation and plan destruction -- e.g. in the python api if we create a context anywhere between:

plan = cufinufft(2, shape, -1, tol, dtype=dtype, opts=opts, gpu_method=1, gpu_device_id=DEV_ID)
plan.set_pts(H_gpu, K_gpu, L_gpu)
plan.execute(nuvect_gpu, ugrid_gpu)
del plan  # or plan going out of scope

I want to point out though that using cufinufft that way presupposes the user is carefully managing memory anyway, as CUFINUFFT_SETPTS always mallocs (which is assumed to be freed with the plan in CUFINUFFT_DESTROY) => so by setting new points without first destroying+creating a plan will cause a memory leak (this might be a good discussion in a different issue @MelodyShih @ahbarnett).

Anyway, I think even with most irresponsible use cufinufft won't have any side effects on existing device contexts. But I'll do some experiments on the interactions between the cuda runtime and the cuda driver to see if there are any odd edge cases. I would prefer not to schlepp around contexts though, as irresponsible use can easily lead to memory waste and performance issues.

Sorry if I'm creating lots of confusing noise here. I hope that this comment clarifies my previous point.

@garrettwrong
Copy link
Collaborator

No worries about noise, you saved me some effort by responding to yourself with similar thoughts :). Thanks for the reference. I am unfortunately familiar; have been dealing with this since 2008. Besides cuda 4.0 and MPS, not much has changed :/. The runtime is just a layer on top of the driver. All the driver-y things are still happening, just hidden, wrapped up. It has historically been useful to initialize the driver before the runtime... that is happening in the unit test via pycuda...

I agree context management is overkill for your usage. However, I also sort of think that usage didn't require these changes either. There are many ways to bake an embarrassingly parallel data science problem to use multiple graphics cards. (We're not exactly doing anything non trivial with multiple cards here, which is what a lot of people might think of as multi-gpu functionality... pretty much any naive cuda code can run uncoupled on disparate gpus without (or with minimal) code changes... ). Regardless, I think this was a great addition to make this easier for your style of workflow, which is probably the most popular this code will see :).

I was going to point out that even in our own unit test, we are using driver contexts and device management via pycuda, in calling code ... but I see you noticed that already. Perhaps that makes my point about us not really controlling things. Similarly, all the nvidia libraries like cuBLAS and friends are passing around a handle (context). I wonder why that is.

If we encounter sophisticated use, the project may need to consider handling contexts. I'm not convinced that requires rewriting everything. I also don't think it is a value add today if the target user (you) is satisfied. I was just asked to look it over and consider if anything got my attention. Besides this, and potentially checking for CUDA errors/synchronization etc around the device changing lines, looks good. FWIW, sometimes when you change devices etc, operation which should be fine, you might catch someoone else's fail. Also if this code ever does any async transfers it may require some device sync in those dev changing areas for safety, but currently I see no async. Might be good for safety anyway, but they can be slow...

~g

@janden
Copy link
Collaborator

janden commented Jan 26, 2021

Thank you for looking over this @garrettwrong.

I can't say that I'm following everything to 100%, but it seems reasonable to me that we don't want to be passing contexts to cufinufft for the reasons both of you outline.

What's not making sense to me is why we need to keep track of the device ID at all if this is already managed at a higher level. If the current context (selected using pycuda) is attached to a certain device, the plan will already be attached to that device, correct? (I'm thinking here of the code prior to merging the PR.) So when do we need to send the device ID to cufinufft?

One potential problem will arise if the user creates a context on a non-default default (that is not device ID 0). If we create a plan without specifying the device ID, it will try to move the context to device 0 (if I understand correctly), which will result in some confusion, if not crashes. Am I missing something here? The code would be something like

drv.init()
dev = drv.Device(1)
ctx = dev.make_context()
plan = cufinufft(1, shape)

Dueo the the defaults, the plan creation will set gpu_device_id to 0, whereas (if I've understood correctly), the previous code will just go on with whatever device is on the current context (device ID 1).

@janden janden mentioned this pull request Jan 26, 2021
@JBlaschke
Copy link
Contributor Author

@garrettwrong @janden I think this is a really interesting discussion -- and it's important that we have it! So this is not about being right -- or calling one approach better in general. Even though I am partial to the runtime API (from what I can see it is more common in HPC than the driver).

One thing I would avoid is forcing the user to use the driver -- the internet has far more runtime API examples than driver examples => new users would probably gravitate towards the runtime. Eg. our application uses the CUDA driver only because PyCUDA uses the driver. I had half a mind to wrap the standard cuda* functions in pybind11 (Ideally NVIDIA would expose their cuda runtime API to python wink, wink), but opted for PyCUDA to handle our HtoD and DtoH data movement. This is the only thing we use PyCUDA for.

My experience with our application has been that I still need to cudaSetDevice even though I already did:

drv.init()
dev = drv.Device(1)
ctx = dev.make_context()

one possibility is that I can' guarantee that the driver init is called first => the runtime already has its own context. Before I can speak with authority on this, I want to run some experiments first.

Basically, I think it boils down to the fact that python currently has no good/easy [1] way to manage multiple devices. From a performance perspective, I think our objective should be for cufinufft to be as noninvasive as possible while allowing multi-gpu support without multiple contexts. In general, I think my adding the device ID to the cufinufft plan is a natural thing to do => the plan manages memory addresses, therefore the device ID is necessary to fully resolve those addresses.

Relating to all of this, @janden brings up a good point though: I was being super lazy about setting the default device to 0. Maybe the default should be: i) find the current context, ii) get the current device from (i). I chose gpu_device_id=0 as a default because it's nice and deterministic (and lazy). Maybe we should choose a clear strategy about how explicit we want the user to be -- eg: if you send the data to the "wrong" device, should cufinufft silently use that instead, or should we give the user a nice error.

Re: async transfers: aren't those attached to a cuda stream (which which is owned by a device) => switching device won't disrupt the transfer as the owning stream persists on the correct device. So (when) we go to async, this should be ok -- right?

When I get a chance this week, I'll set up a bunch of multi-gpu mixed driver+runtime test cases, to try and capture these edge cases.

[1]: I know there's CuPy, but it also feels like overkill to add that (hefty) dependency in order to just manage devices.

@janden
Copy link
Collaborator

janden commented Jan 27, 2021

one possibility is that I can' guarantee that the driver init is called first => the runtime already has its own context. Before I can speak with authority on this, I want to run some experiments first.

Ok so in that case the plan needs to have a device ID attached to it? I guess I'm viewing this narrowly through the lens of PyCuda as the high level, but perhaps there are other use cases where this is necessary.

Maybe the default should be: i) find the current context, ii) get the current device from (i). I chose gpu_device_id=0 as a default because it's nice and deterministic (and lazy). Maybe we should choose a clear strategy about how explicit we want the user to be -- eg: if you send the data to the "wrong" device, should cufinufft silently use that instead, or should we give the user a nice error.

Yes I think this would make sense. That way, the “dead stupid” approach of creating a context on a device and doing all your cufinufft work there will still work even if you forget to set gpu_device_id (just like the other PyCuda functions). Could you put together a PR for this?

@janden
Copy link
Collaborator

janden commented Feb 2, 2021

@JBlaschke Do you think you could put together a PR to fix the default device ID as discussed above? I can take a stab, otherwise.

@JBlaschke
Copy link
Contributor Author

JBlaschke commented Feb 2, 2021

Hi @janden, sorry, I didn't notice your question about a PR until now. I am currently working on exactly that and will submit a PR either today or tomorrow. I'll also try and include some tests.

@JBlaschke
Copy link
Contributor Author

@janden sorry for the delay, I have encountered a potential bug in cuFINUFFT that I had to investigate further. I'll make a PR for the bug and the default contexts soon though.

@JBlaschke
Copy link
Contributor Author

@janden I've started working on this in #99 -- I think this distills our conversion from this PR, so I'm linking it here.

It's not quite done yet (i have to finish wrapping all the function that use cudaSetDevice, and document what I found -- cuda api + driver is quite the rabbit hole), but the rest is pretty routine (I hope).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants