-
Notifications
You must be signed in to change notification settings - Fork 285
chore(gpu): structure to encapsulate streams #2725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a8d639d to
bcb82e1
Compare
18a73c5 to
58f7e88
Compare
|
Is this ready for review or should it have been a draft PR ? |
|
Asking since the PR says POC |
|
@IceTDrinker not ready for review yet, CI is red. |
|
I understand that :) just wondering if this needs a review down the line as it's labelled POC |
|
It will 😉 we'll change the PR title once it's ready. I think it'll be omstly for me to review as it's a low level change in the Cuda backend. |
58f7e88 to
abc5544
Compare
abc5544 to
3c3ffb1
Compare
|
I don't know if we want to attach this set of streams to the LUT, it is actually the one that comes from Rust. Is there a need to have it here really? If possible we should probably not have any. |
|
I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker |
|
So I would not name this assign_clone, here what this function does is to create a new set of streams on the same subset of GPUs. Probably create_from would be better? |
|
Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file) |
|
streams_clone -> new_streams? |
|
It's ok to use streams here and restrict the number of GPUs to active_gpu_count |
|
Do we need to keep track of active_gpu_count in the int_noise_squashing_lut struct? I don't think so, this one we could remove altogether |
|
Previously, agnesLeroy (Agnès Leroy) wrote…
The only thing we need to keep in int_radix_lut is active_gpu_count because it's used in the release function. |
|
Previously, agnesLeroy (Agnès Leroy) wrote…
We could even get rid of this by looping over pbs_buffer.size() in the release function |
|
How about looping over pbs_buffer.size() to avoid having to store active_gpu_count? |
|
It's ok if we do streams.synchronize(); here, it shouldn't impact performance at all. |
|
Previously, agnesLeroy (Agnès Leroy) wrote…
We could define auto active_gpu_count = pbs_buffer.size(); |
|
And here we could keep the loop over active_gpu_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agnesLeroy reviewed 2 of 66 files at r1, 2 of 3 files at r2, all commit messages.
Reviewable status: 4 of 66 files reviewed, 26 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):
lut_indexes, new_lut_indexes, new_num_blocks * sizeof(Torus), streams.stream(0), streams.gpu_index(0), gpu_memory_allocated); auto new_active_gpus = streams.active_gpu_subset(new_num_blocks);
new_active_gpus -> active_streams?
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2513 at r2 (raw file):
gpu_memory_allocated); auto active_gpus = streams.active_gpu_subset(1);
For this variable name I'd suggest
active_gpus -> active_streams
(applies everywhere)
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2831 at r2 (raw file):
int_radix_params params; CudaStreams active_gpus;
active_gpu_count was used to track how many sub streams and events are created, I think we should keep the logic and only store the number of active gpus for substreams & events here.
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2922 at r2 (raw file):
outgoing_events2 = (cudaEvent_t *)malloc(active_gpus.count() * sizeof(cudaEvent_t)); for (uint j = 0; j < active_gpus.count(); j++) {
Again, you can loop over active_gpu_count and use streams as a variable below
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2979 at r2 (raw file):
CudaStreams true_streams; CudaStreams false_streams; CudaStreams active_gpus;
Here there are no events so we could get rid of active_gp_count in the struct altogether, no need to store active_gpus
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3381 at r2 (raw file):
CudaStreams local_streams_1; CudaStreams local_streams_2; CudaStreams active_gpus;
You can remove this from the struct, no events so no need to keep track of active_gpu_count in the struct anymore
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):
2 * num_radix_blocks * sizeof(Torus), streams.stream(0), streams.gpu_index(0), allocate_gpu_memory); auto active_gpus_pred = streams.active_gpu_subset(2 * num_radix_blocks);
active_gpus -> active_streams
(everywhere)
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4017 at r2 (raw file):
CudaStreams lsb_streams; CudaStreams msb_streams; CudaStreams active_gpus;
Same here you can remove it from the struct
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4223 at r2 (raw file):
template <typename Torus> struct unsigned_int_div_rem_memory { int_radix_params params; CudaStreams active_gpus;
Again here there are no events no need to keep this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, overall it's starting to look really good, thanks a lot for the changes @andrei-stoian-zama
Reviewable status: 4 of 66 files reviewed, 28 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)
backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):
PANIC("Cuda error: number of gpus in scatter should be <= number of gpus " "used to create the lut") dest.resize(streams.count());
So isn't it possible to remove thise dest.resize()?
tfhe/src/high_level_api/integers/signed/tests/gpu.rs line 159 at r2 (raw file):
let gt_tmp_buffer_size = a.get_gt_size_on_gpu(b); println!("1");
Need to remove all the println from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the core part there isn't much to say so I guess ok like this
@IceTDrinker reviewed 3 of 66 files at r1, 1 of 3 files at r2.
Reviewable status: 7 of 66 files reviewed, 28 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)
Makefile line 1001 at r2 (raw file):
Previously, agnesLeroy (Agnès Leroy) wrote…
I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker
targets don't have to be useful in CI, if it's a commonly used command can be ok to be honest
6455f02 to
3ced713
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some minor changes, otherwise it looks really good. We should run the 64-bit latency / throughput benchmarks just to check that performance is not affected
@agnesLeroy reviewed 50 of 66 files at r1, 14 of 14 files at r3, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)
backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r2 (raw file):
Previously, agnesLeroy (Agnès Leroy) wrote…
Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file)
These are still some to be replaced it seems
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):
Previously, agnesLeroy (Agnès Leroy) wrote…
new_active_gpus -> active_streams?
Still some old naming new_active_gpus, should be new_active_streams
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):
Previously, agnesLeroy (Agnès Leroy) wrote…
active_gpus -> active_streams
(everywhere)
Some more old naming, also for the one below. That's why I use sed 😆
backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):
Previously, agnesLeroy (Agnès Leroy) wrote…
So isn't it possible to remove thise dest.resize()?
I have this change in another PR, probably better not to do it here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 347 at r3 (raw file):
/////////////// for (uint i = 0; i < active_streams.count(); i++) { cuda_set_device(streams.gpu_index(i));
Here it would probably be better to use active_streams instead of streams
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 355 at r3 (raw file):
uint64_t size = 0; execute_scratch_pbs<Torus>( streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 368 at r3 (raw file):
// We create the events only if we have multiple GPUs if (active_streams.count() > 1) { event_scatter_in = cuda_create_event(streams.gpu_index(0));
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 369 at r3 (raw file):
if (active_streams.count() > 1) { event_scatter_in = cuda_create_event(streams.gpu_index(0)); event_broadcast = cuda_create_event(streams.gpu_index(0));
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 383 at r3 (raw file):
for (uint i = 0; i < active_streams.count(); i++) { auto lut = (Torus *)cuda_malloc_with_size_tracking_async( num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 386 at r3 (raw file):
size_tracker, allocate_gpu_memory); auto lut_indexes = (Torus *)cuda_malloc_with_size_tracking_async( lut_indexes_size, streams.stream(i), streams.gpu_index(i),
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 391 at r3 (raw file):
// if a different behavior is wanted, it should be rewritten later cuda_memset_with_size_tracking_async( lut_indexes, 0, lut_indexes_size, streams.stream(i),
Same here
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 443 at r3 (raw file):
allocate_gpu_memory); cuda_synchronize_stream(streams.stream(0), streams.gpu_index(0)); multi_gpu_copy_array_async(active_streams, lwe_trivial_indexes_vec,
Why is active_streams used here and not in the allocation functions? It would make more sense to use it for the multi gpu allocs as well no?
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 560 at r3 (raw file):
active_streams = streams.active_gpu_subset(num_radix_blocks); for (uint i = 0; i < active_streams.count(); i++) { cuda_set_device(streams.gpu_index(i));
streams -> active_streams?
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 568 at r3 (raw file):
uint64_t size = 0; execute_scratch_pbs<Torus>( streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,
Again
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 580 at r3 (raw file):
// We create the events only if we have multiple GPUs if (active_streams.count() > 1) { event_scatter_in = cuda_create_event(streams.gpu_index(0));
Again and the one below
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 594 at r3 (raw file):
for (uint i = 0; i < active_streams.count(); i++) { auto lut = (Torus *)cuda_malloc_with_size_tracking_async( num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),
Again and the two below
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 775 at r3 (raw file):
void release(CudaStreams streams) { GPU_ASSERT(lut_indexes_vec.size() == lut_vec.size(),
PANIC_IF_FALSE would be better I think
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 827 at r3 (raw file):
multi_gpu_release_async(streams, lwe_after_pbs_vec); multi_gpu_release_async(streams, lwe_trivial_indexes_vec); streams.synchronize();
I think we could remove this streams.synchronize, we don't need it as we synchronize on active streams below. This is some legacy synchronization I had introduced because I wasn't sure whether calling clear without synchronizing first would be ok, but it is as the GPU pointers have their own "life" on the GPU. One the drop instruction is issued from the CPU it's fine to clear.
backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1054 at r3 (raw file):
multi_gpu_release_async(streams, lwe_after_pbs_vec); multi_gpu_release_async(streams, lwe_trivial_indexes_vec); streams.synchronize();
Again I don't think we need to synchronize here. Maybe we could keep these two synchronize calls just to avoid introducing a bad bug in this PR and try to do it after in another PR just to be safer.
3ced713 to
7e14941
Compare
|
Still a GPU_ASSERT here |
In reviewable if I select r4, which I would guess is the latest version, it seems it's not there anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looks all good, let's address the comments about using active_streams and removing some synchronize in another PR 👍
Encapsulate stream handling in the FFI and the backend
This change is