Skip to content

Conversation

@andrei-stoian-zama
Copy link
Contributor

@andrei-stoian-zama andrei-stoian-zama commented Sep 3, 2025

Encapsulate stream handling in the FFI and the backend


This change is Reviewable

@cla-bot cla-bot bot added the cla-signed label Sep 3, 2025
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch 5 times, most recently from a8d639d to bcb82e1 Compare September 11, 2025 12:41
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch 2 times, most recently from 18a73c5 to 58f7e88 Compare September 11, 2025 19:21
@zama-bot zama-bot removed the approved label Sep 11, 2025
@andrei-stoian-zama andrei-stoian-zama marked this pull request as ready for review September 11, 2025 19:28
@IceTDrinker
Copy link
Member

Is this ready for review or should it have been a draft PR ?

@IceTDrinker
Copy link
Member

Asking since the PR says POC

@agnesLeroy
Copy link
Contributor

@IceTDrinker not ready for review yet, CI is red.

@IceTDrinker
Copy link
Member

I understand that :) just wondering if this needs a review down the line as it's labelled POC

@agnesLeroy
Copy link
Contributor

It will 😉 we'll change the PR title once it's ready. I think it'll be omstly for me to review as it's a low level change in the Cuda backend.

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

                uint32_t num_radix_blocks, bool allocate_gpu_memory,
                uint64_t &size_tracker)
      : residence_streams(streams) {

I don't know if we want to attach this set of streams to the LUT, it is actually the one that comes from Rust. Is there a need to have it here really? If possible we should probably not have any.

@agnesLeroy
Copy link
Contributor

Makefile line 1001 at r2 (raw file):

	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) nextest list --cargo-profile $(CARGO_PROFILE) \
		--features=integer,internal-keycache,gpu,zk-pok -p tfhe \
		-E "test(/.*gpu.*/)"

I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 113 at r2 (raw file):

  // set Can be used to parallelize computation by issuing kernels on multiple
  // streams on the same GPU
  void assign_clone(const CudaStreams &other) {

So I would not name this assign_clone, here what this function does is to create a new set of streams on the same subset of GPUs. Probably create_from would be better?

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r2 (raw file):

  // streams on the same GPU
  void assign_clone(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file)

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 116 at r2 (raw file):

    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

    cudaStream_t *streams_clone = new cudaStream_t[other._gpu_count];

streams_clone -> new_streams?

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 837 at r2 (raw file):

      lwe_trivial_indexes_vec.clear();
      if (active_gpus.count() > 1) {
        active_gpus.synchronize();

It's ok to use streams here and restrict the number of GPUs to active_gpu_count

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 923 at r2 (raw file):

    ///////////////
    active_gpus = streams.active_gpu_subset(num_radix_blocks);

Do we need to keep track of active_gpu_count in the int_noise_squashing_lut struct? I don't think so, this one we could remove altogether

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

Actually right now there are two:

  • residence_streams which are the ones coming from Rust
  • active_gpus which is a subset depending on the number of blocks
    This seems too complicated: how about getting rid of the two?
    When calling the constructor that does mem_reuse the set of streams should be the one coming from Rust. I think that's all we need to do. Having a set of streams encapsulated in the lut makes it more complicated to follow what's going on I think: given that we always recompute the active gpu count when broadcasting or calling execute_keyswitch / pbs this should work.

The only thing we need to keep in int_radix_lut is active_gpu_count because it's used in the release function.

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

The only thing we need to keep in int_radix_lut is active_gpu_count because it's used in the release function.

We could even get rid of this by looping over pbs_buffer.size() in the release function

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1017 at r2 (raw file):

  }
  void release(CudaStreams streams) {
    for (uint i = 0; i < active_gpus.count(); i++) {

How about looping over pbs_buffer.size() to avoid having to store active_gpu_count?

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1057 at r2 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    active_gpus.synchronize();

It's ok if we do streams.synchronize(); here, it shouldn't impact performance at all.

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1017 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

How about looping over pbs_buffer.size() to avoid having to store active_gpu_count?

We could define

auto active_gpu_count = pbs_buffer.size();

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1057 at r2 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    active_gpus.synchronize();

And here we could keep the loop over active_gpu_count

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agnesLeroy reviewed 2 of 66 files at r1, 2 of 3 files at r2, all commit messages.
Reviewable status: 4 of 66 files reviewed, 26 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):

        lut_indexes, new_lut_indexes, new_num_blocks * sizeof(Torus),
        streams.stream(0), streams.gpu_index(0), gpu_memory_allocated);
    auto new_active_gpus = streams.active_gpu_subset(new_num_blocks);

new_active_gpus -> active_streams?


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2513 at r2 (raw file):

          gpu_memory_allocated);

      auto active_gpus = streams.active_gpu_subset(1);

For this variable name I'd suggest
active_gpus -> active_streams
(applies everywhere)


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2831 at r2 (raw file):

  int_radix_params params;

  CudaStreams active_gpus;

active_gpu_count was used to track how many sub streams and events are created, I think we should keep the logic and only store the number of active gpus for substreams & events here.


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2922 at r2 (raw file):

    outgoing_events2 =
        (cudaEvent_t *)malloc(active_gpus.count() * sizeof(cudaEvent_t));
    for (uint j = 0; j < active_gpus.count(); j++) {

Again, you can loop over active_gpu_count and use streams as a variable below


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2979 at r2 (raw file):

  CudaStreams true_streams;
  CudaStreams false_streams;
  CudaStreams active_gpus;

Here there are no events so we could get rid of active_gp_count in the struct altogether, no need to store active_gpus


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3381 at r2 (raw file):

  CudaStreams local_streams_1;
  CudaStreams local_streams_2;
  CudaStreams active_gpus;

You can remove this from the struct, no events so no need to keep track of active_gpu_count in the struct anymore


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):

        2 * num_radix_blocks * sizeof(Torus), streams.stream(0),
        streams.gpu_index(0), allocate_gpu_memory);
    auto active_gpus_pred = streams.active_gpu_subset(2 * num_radix_blocks);

active_gpus -> active_streams
(everywhere)


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4017 at r2 (raw file):

  CudaStreams lsb_streams;
  CudaStreams msb_streams;
  CudaStreams active_gpus;

Same here you can remove it from the struct


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4223 at r2 (raw file):

template <typename Torus> struct unsigned_int_div_rem_memory {
  int_radix_params params;
  CudaStreams active_gpus;

Again here there are no events no need to keep this here.

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, overall it's starting to look really good, thanks a lot for the changes @andrei-stoian-zama

Reviewable status: 4 of 66 files reviewed, 28 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)


backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):

    PANIC("Cuda error: number of gpus in scatter should be <= number of gpus "
          "used to create the lut")
  dest.resize(streams.count());

So isn't it possible to remove thise dest.resize()?


tfhe/src/high_level_api/integers/signed/tests/gpu.rs line 159 at r2 (raw file):

    let gt_tmp_buffer_size = a.get_gt_size_on_gpu(b);
    println!("1");

Need to remove all the println from here

Copy link
Member

@IceTDrinker IceTDrinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the core part there isn't much to say so I guess ok like this

@IceTDrinker reviewed 3 of 66 files at r1, 1 of 3 files at r2.
Reviewable status: 7 of 66 files reviewed, 28 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)


Makefile line 1001 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker

targets don't have to be useful in CI, if it's a commonly used command can be ok to be honest

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some minor changes, otherwise it looks really good. We should run the 64-bit latency / throughput benchmarks just to check that performance is not affected

@agnesLeroy reviewed 50 of 66 files at r1, 14 of 14 files at r3, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)


backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file)

These are still some to be replaced it seems


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

new_active_gpus -> active_streams?

Still some old naming new_active_gpus, should be new_active_streams


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

active_gpus -> active_streams
(everywhere)

Some more old naming, also for the one below. That's why I use sed 😆


backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

So isn't it possible to remove thise dest.resize()?

I have this change in another PR, probably better not to do it here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 347 at r3 (raw file):

    ///////////////
    for (uint i = 0; i < active_streams.count(); i++) {
      cuda_set_device(streams.gpu_index(i));

Here it would probably be better to use active_streams instead of streams


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 355 at r3 (raw file):

      uint64_t size = 0;
      execute_scratch_pbs<Torus>(
          streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 368 at r3 (raw file):

    // We create the events only if we have multiple GPUs
    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 369 at r3 (raw file):

    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));
      event_broadcast = cuda_create_event(streams.gpu_index(0));

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 383 at r3 (raw file):

    for (uint i = 0; i < active_streams.count(); i++) {
      auto lut = (Torus *)cuda_malloc_with_size_tracking_async(
          num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 386 at r3 (raw file):

          size_tracker, allocate_gpu_memory);
      auto lut_indexes = (Torus *)cuda_malloc_with_size_tracking_async(
          lut_indexes_size, streams.stream(i), streams.gpu_index(i),

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 391 at r3 (raw file):

      // if a different behavior is wanted, it should be rewritten later
      cuda_memset_with_size_tracking_async(
          lut_indexes, 0, lut_indexes_size, streams.stream(i),

Same here


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 443 at r3 (raw file):

                                allocate_gpu_memory);
    cuda_synchronize_stream(streams.stream(0), streams.gpu_index(0));
    multi_gpu_copy_array_async(active_streams, lwe_trivial_indexes_vec,

Why is active_streams used here and not in the allocation functions? It would make more sense to use it for the multi gpu allocs as well no?


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 560 at r3 (raw file):

    active_streams = streams.active_gpu_subset(num_radix_blocks);
    for (uint i = 0; i < active_streams.count(); i++) {
      cuda_set_device(streams.gpu_index(i));

streams -> active_streams?


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 568 at r3 (raw file):

      uint64_t size = 0;
      execute_scratch_pbs<Torus>(
          streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,

Again


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 580 at r3 (raw file):

    // We create the events only if we have multiple GPUs
    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));

Again and the one below


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 594 at r3 (raw file):

    for (uint i = 0; i < active_streams.count(); i++) {
      auto lut = (Torus *)cuda_malloc_with_size_tracking_async(
          num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),

Again and the two below


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 775 at r3 (raw file):

  void release(CudaStreams streams) {
    GPU_ASSERT(lut_indexes_vec.size() == lut_vec.size(),

PANIC_IF_FALSE would be better I think


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 827 at r3 (raw file):

        multi_gpu_release_async(streams, lwe_after_pbs_vec);
        multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
        streams.synchronize();

I think we could remove this streams.synchronize, we don't need it as we synchronize on active streams below. This is some legacy synchronization I had introduced because I wasn't sure whether calling clear without synchronizing first would be ok, but it is as the GPU pointers have their own "life" on the GPU. One the drop instruction is issued from the CPU it's fine to clear.


backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1054 at r3 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    streams.synchronize();

Again I don't think we need to synchronize here. Maybe we could keep these two synchronize calls just to avoid introducing a bad bug in this PR and try to do it after in another PR just to be safer.

@agnesLeroy
Copy link
Contributor

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r3 (raw file):

  // streams on the same GPU
  void create_on_same_gpus(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

Still a GPU_ASSERT here

@andrei-stoian-zama
Copy link
Contributor Author

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r3 (raw file):

  // streams on the same GPU
  void create_on_same_gpus(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

Still a GPU_ASSERT here

In reviewable if I select r4, which I would guess is the latest version, it seems it's not there anymore

Copy link
Contributor

@agnesLeroy agnesLeroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks all good, let's address the comments about using active_streams and removing some synchronize in another PR 👍

@andrei-stoian-zama andrei-stoian-zama merged commit 1dcc3c8 into main Sep 18, 2025
117 of 119 checks passed
@andrei-stoian-zama andrei-stoian-zama deleted the as/struct_for_gpu_indices branch September 18, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants