-
Notifications
You must be signed in to change notification settings - Fork 20
Add basic multi-gpu support #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please let me know if you need anything more complex. We're controlling our workflow from python => I wanted to keep this PR as minimal as possible to not cause any problems upstream. |
Thank you for putting this together .I'm sorry we haven't gotten around to reviewing this. The past two weeks have been very busy on my end but I'll try to get through this sometime next week. |
@janden No worries! I have the same problem (big project review overlapping with SC20). Let me know if you have any questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty straightforward. My main worry is about state. If we go around calling cudaSetDevice
, is that going to mess up other CUDA libraries? Or is it up to each library to make sure it's operating on the right device?
feel free to make those changes (you should have write access to my repo) -- or let me know and I'll implement them myself @janden |
Ok I'll make the changes. What about |
You me like we do here 6071e25#diff-381d9fc71fd414e7c4ba1a28764788aa8aa95c5f971797d9219da10e213b9400R104 ? Short answer: yes Longer answer: I might be using it slightly too much (it's fine to call once per process) so we can call it during the plan creation only. But I am a little paranoid about these things, so I tend to want to call it every time I enter a library call that needs it. |
... the situation that I am paranoid about is if 2 libraries set it to different devices. The way it's now, all functions called by the python API would address the correct device. Something to think about is to query the device ID and reset it at the end of the top-level function calls. This way we won't be interfering with libraries that are not as paranoid. |
Right, that last part is my concern. Is the default convention that all libraries should be paranoid or should the be courteous and reset the device ID after they're done? |
I was initially worried about performance. Now that I've measured it, and I see that I.e. libraries should not change the state of the device for any other libraries. I suggest we wrap all the functions that set the device as follows:
|
Sounds wonderful. Please try the new interface and see how it fits with your code. If you can, it would also be a good idea to run |
Will do -- just waiting for my Cori GPU allocation for testing |
just to confirm @janden you haven't made the changes to the options api elsewhere? |
I rebased onto the new master (which includes the changes from #78). As a result, you no longer specify the |
Thanks! |
Hey @janden -- just wanted to quickly check with you that, even though I'm doing this: plan = cufinufft(2, shape, -1, tol, dtype=dtype, gpu_method=1, gpu_device_id=DEV_ID) and |
ok found the problem! It was me (quel suprize!). I had a |
cufinufft(2, shape, -1, ...
what is this 3rd arg -1 ? The culprit?
…On Fri, Jan 22, 2021 at 7:29 PM Johannes Blaschke ***@***.***> wrote:
Hey @janden <https://github.com/janden> -- just wanted to quickly check
with you that, even though I'm doing this:
plan = cufinufft(2, shape, -1, tol, dtype=dtype, gpu_method=1, gpu_device_id=DEV_ID)
and printf shows that opts->gpu_device_id stays as 0 in the cuda code --
am I missing something here?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSV32HYIQV64PI4QQFLS3IJ5JANCNFSM4S6T2FYA>
.
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
hrm... my latest commit wasn't right either ... lemme try something else |
ok -- i always feel proud if i can use |
@janden this version works on Cori GPU -- I tested with 16 ranks on 8 GPUs |
Great!
(Can I ask a related question which is why >1 rank per GPU is expected to
squeeze out any more performance that 1 rank per GPU? Don't codes use all
the GPU cores? Maybe there are some codes that don't ...)
…On Fri, Jan 22, 2021 at 8:00 PM Johannes Blaschke ***@***.***> wrote:
@janden <https://github.com/janden> this version works on Cori GPU -- I
tested with 16 ranks on 8 GPUs
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSQUW35776W5U75MCZ3S3INTLANCNFSM4S6T2FYA>
.
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
Oh @ahbarnett that's mainly because a common strategy for latency hiding is for one thread to launch kernels while another does memcpy -- In python true multi-threading is a bit of a can of worms, so a common strategy is to have |
Now I'm confused. I thought |
Oh I see now. It's a ctypes thing. The variable |
In python the class Basic(object):
pass
b = Basic()
setattr(b, "z", 1)
print(b.__dict__) this will always output |
The Your way here: 918d45d#diff-55a88bcb0ebecf4899b36b07fdea411828b2ac581559f49dd7d494b8896940f6R108 does essentially the same thing at the cost of extra memory. |
FTR: the loop comprehension in your code might also be harder to optimize -- I know that python's |
Eh for speed it depends (see https://stackoverflow.com/questions/1247486/list-comprehension-vs-map). In our case, we have
For memory, I agree. Still, it's a one-time cost so I'd rather optimize for readability here. Although one could (I wouldn't) argue that |
Typo from before we changed the interface.
Let the user specify `gpu_device_id` when creating the plan.
This way if other libraries depend on the device ID staying the same, we won't interfere.
The destructor expects to see `self.plan` so let's make sure it's there in case something goes wrong.
Previously `setattr` was used, which does not error if you try to assign to a non-existent key. Instead it creates a new entry, which is not what we want.
If more than one GPU is available, tests the multi-GPU interface on all available GPUs.
If we provide an invalid option when constructing the plan, we should get an error so let's verify that.
I don't want to start a long debate about the pros and cons of
This is because Anyway, I'm not suggesting that you put |
Huh. That makes a lot of sense. Anyhow I think we're in agreement. I've squashed and rebased (I hope you're not too offended that the list comprehension made it into your commit) and we're ready for merge once Jenkins passes. |
🎉 Woohoo -- time to get a beer! (to celebrate the feature, not bemoan a short-lived map) |
Indeed. Have a good weekend! |
One concern is that I think you might want to pass around contexts instead of device ids. Generally a context is/becomes associated with a device, so mostly equivalent to what was done... However, if some code (outside cufinufft) changes the context on you... and that can happen for a lot of reasons.... it really doesn't matter what devID is provided, the virtual memory space is not the same under different contexts. If you know the context you can potentially get back there... knowing the device is not enough (unless things have changed recently). The setting device stuff is all happening inside some context. In sophisticated applications you might need to know both things, but I am going to assume the simplest 1process-1thread-1context-1device situation for us (our code). Otherwise, we're out of scope. With that said, I actually am not sure we needed to do any device setting or resetting if the context is already created and bound to a specified device. (but it has been a little while...) That is, the cuda code changes might not have been required at all if the contexts/devices are all to be managed by calling code anyway, as it is being done. You would want this logic for contexts if you ever decide to change it. That's the main corner case I think you might see trying to mix in with other potentially multi gpu code. You might wait to see if it is an issue for other users, as I do not think this affects the current work. Worth noting the example unit test is not actually running concurrently on multiple gpus. It is sequentially using different gpus. To do the former would require multi threading or processing at some level. Should still be good enough to check most of the machinery. Thanks for putting it together! |
Hi @garrettwrong -- that's a good observation, but it's overkill. I think there might be a little bit of confusion here: cuda contexts are a concept that is only available to the cuda driver and are not available to the cuda runtime API. The application as it stands now uses the cuda runtime API and not the driver, so managing contexts is not possible (and probably not necessary) My reading of the runtime API and driver spec is that the cuda runtime API manages its own context independently of the driver context stack (it has 1 context -- the primary context). This is apparent when you create a cuda driver context, and set this to device That is to say, when you call The only caveat is where users are managing the device context used by the cuda runtime API (i.e. the primary context), then all of this falls apart. To be honest though, if a user is messing with the primary context, then they should be expected to know what they are doing. If you want to go down the road of manual context management, then this library has to be re-written to use the |
For reference, have a look at this: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
|
I realize that what I said might not be right when a context has already been created on the device ... I.e. when the user calls There are lots of edge cases here that I think this version works for -- but the main usage models that I come across in multi-gpu applications are:
If the user uses some thing between 1 and 2 -- i.e. some components use the driver, while other components use the runtime -- then our approach should be safe:
Fun fact: I'm using cufinufft in an application that does mix and match 1 & 2 (PyCUDA components using the driver, and pybind11 components using the API). So far the wheels haven't come off :) The only places where I can see this falling apart is when the user changes context between plan creation and plan destruction -- e.g. in the python api if we create a context anywhere between: plan = cufinufft(2, shape, -1, tol, dtype=dtype, opts=opts, gpu_method=1, gpu_device_id=DEV_ID)
plan.set_pts(H_gpu, K_gpu, L_gpu)
plan.execute(nuvect_gpu, ugrid_gpu)
del plan # or plan going out of scope I want to point out though that using Anyway, I think even with most irresponsible use Sorry if I'm creating lots of confusing noise here. I hope that this comment clarifies my previous point. |
No worries about noise, you saved me some effort by responding to yourself with similar thoughts :). Thanks for the reference. I am unfortunately familiar; have been dealing with this since 2008. Besides cuda 4.0 and MPS, not much has changed :/. The runtime is just a layer on top of the driver. All the driver-y things are still happening, just hidden, wrapped up. It has historically been useful to initialize the driver before the runtime... that is happening in the unit test via pycuda... I agree context management is overkill for your usage. However, I also sort of think that usage didn't require these changes either. There are many ways to bake an embarrassingly parallel data science problem to use multiple graphics cards. (We're not exactly doing anything non trivial with multiple cards here, which is what a lot of people might think of as multi-gpu functionality... pretty much any naive cuda code can run uncoupled on disparate gpus without (or with minimal) code changes... ). Regardless, I think this was a great addition to make this easier for your style of workflow, which is probably the most popular this code will see :). I was going to point out that even in our own unit test, we are using driver contexts and device management via pycuda, in calling code ... but I see you noticed that already. Perhaps that makes my point about us not really controlling things. Similarly, all the nvidia libraries like cuBLAS and friends are passing around a handle (context). I wonder why that is. If we encounter sophisticated use, the project may need to consider handling contexts. I'm not convinced that requires rewriting everything. I also don't think it is a value add today if the target user (you) is satisfied. I was just asked to look it over and consider if anything got my attention. Besides this, and potentially checking for CUDA errors/synchronization etc around the device changing lines, looks good. FWIW, sometimes when you change devices etc, operation which should be fine, you might catch someoone else's fail. Also if this code ever does any async transfers it may require some device sync in those dev changing areas for safety, but currently I see no async. Might be good for safety anyway, but they can be slow... ~g |
Thank you for looking over this @garrettwrong. I can't say that I'm following everything to 100%, but it seems reasonable to me that we don't want to be passing contexts to cufinufft for the reasons both of you outline. What's not making sense to me is why we need to keep track of the device ID at all if this is already managed at a higher level. If the current context (selected using pycuda) is attached to a certain device, the plan will already be attached to that device, correct? (I'm thinking here of the code prior to merging the PR.) So when do we need to send the device ID to cufinufft? One potential problem will arise if the user creates a context on a non-default default (that is not device ID 0). If we create a plan without specifying the device ID, it will try to move the context to device 0 (if I understand correctly), which will result in some confusion, if not crashes. Am I missing something here? The code would be something like
Dueo the the defaults, the plan creation will set |
@garrettwrong @janden I think this is a really interesting discussion -- and it's important that we have it! So this is not about being right -- or calling one approach better in general. Even though I am partial to the runtime API (from what I can see it is more common in HPC than the driver). One thing I would avoid is forcing the user to use the driver -- the internet has far more runtime API examples than driver examples => new users would probably gravitate towards the runtime. Eg. our application uses the CUDA driver only because PyCUDA uses the driver. I had half a mind to wrap the standard My experience with our application has been that I still need to drv.init()
dev = drv.Device(1)
ctx = dev.make_context() one possibility is that I can' guarantee that the driver init is called first => the runtime already has its own context. Before I can speak with authority on this, I want to run some experiments first. Basically, I think it boils down to the fact that python currently has no good/easy [1] way to manage multiple devices. From a performance perspective, I think our objective should be for cufinufft to be as noninvasive as possible while allowing multi-gpu support without multiple contexts. In general, I think my adding the device ID to the cufinufft plan is a natural thing to do => the plan manages memory addresses, therefore the device ID is necessary to fully resolve those addresses. Relating to all of this, @janden brings up a good point though: I was being super lazy about setting the default device to 0. Maybe the default should be: i) find the current context, ii) get the current device from (i). I chose Re: async transfers: aren't those attached to a cuda stream (which which is owned by a device) => switching device won't disrupt the transfer as the owning stream persists on the correct device. So (when) we go to async, this should be ok -- right? When I get a chance this week, I'll set up a bunch of multi-gpu mixed driver+runtime test cases, to try and capture these edge cases. [1]: I know there's CuPy, but it also feels like overkill to add that (hefty) dependency in order to just manage devices. |
Ok so in that case the plan needs to have a device ID attached to it? I guess I'm viewing this narrowly through the lens of PyCuda as the high level, but perhaps there are other use cases where this is necessary.
Yes I think this would make sense. That way, the “dead stupid” approach of creating a context on a device and doing all your cufinufft work there will still work even if you forget to set |
@JBlaschke Do you think you could put together a PR to fix the default device ID as discussed above? I can take a stab, otherwise. |
Hi @janden, sorry, I didn't notice your question about a PR until now. I am currently working on exactly that and will submit a PR either today or tomorrow. I'll also try and include some tests. |
@janden sorry for the delay, I have encountered a potential bug in cuFINUFFT that I had to investigate further. I'll make a PR for the bug and the default contexts soon though. |
@janden I've started working on this in #99 -- I think this distills our conversion from this PR, so I'm linking it here. It's not quite done yet (i have to finish wrapping all the function that use |
Following the discussing in #67, I implemented some basic multi-GPU support. The strategy with cufinufft has been to:
cuda_device_id
to thecufinufft_opts
structcufinufft
functions that either take the options (or a copy of the options in the plan) respect this setting:CUFINUFFT_MAKEPLAN
CUFINUFFT_SETPTS
CUFINUFFT_EXECUTE
CUFINUFFT_DESTROY
memtransfer_wrapper.cu
In theory, we can enable more fine-grained multi-gpu control, but this is more than enough (the
cudaSetDevice
calls inmemtransfer_wrapper.cu
are probably overkill -- but I tend not to take chances with memory) for data-parallel workflows. Here is an example of a data-parallel workflow in python:I can package some basic test scripts based in this workflow if all y'all are interested, but I wanted to solicit some feedback first.
This works nicely on NERSC's Cori-GPU. Gonna try OLCF Summit next, and maybe our DGX cluster.