chore(gpu): structure to encapsulate streams #2725

andrei-stoian-zama · 2025-09-03T14:08:20Z

Encapsulate stream handling in the FFI and the backend

This change is

IceTDrinker · 2025-09-11T19:42:39Z

Is this ready for review or should it have been a draft PR ?

IceTDrinker · 2025-09-11T19:44:36Z

Asking since the PR says POC

agnesLeroy · 2025-09-12T07:15:23Z

@IceTDrinker not ready for review yet, CI is red.

IceTDrinker · 2025-09-12T08:14:58Z

I understand that :) just wondering if this needs a review down the line as it's labelled POC

agnesLeroy · 2025-09-12T08:37:28Z

It will 😉 we'll change the PR title once it's ready. I think it'll be omstly for me to review as it's a low level change in the Cuda backend.

agnesLeroy · 2025-09-15T11:06:42Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

                uint32_t num_radix_blocks, bool allocate_gpu_memory,
                uint64_t &size_tracker)
      : residence_streams(streams) {

I don't know if we want to attach this set of streams to the LUT, it is actually the one that comes from Rust. Is there a need to have it here really? If possible we should probably not have any.

agnesLeroy · 2025-09-15T11:07:16Z

Makefile line 1001 at r2 (raw file):

	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) nextest list --cargo-profile $(CARGO_PROFILE) \
		--features=integer,internal-keycache,gpu,zk-pok -p tfhe \
		-E "test(/.*gpu.*/)"

I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker

agnesLeroy · 2025-09-15T11:09:48Z

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 113 at r2 (raw file):

  // set Can be used to parallelize computation by issuing kernels on multiple
  // streams on the same GPU
  void assign_clone(const CudaStreams &other) {

So I would not name this assign_clone, here what this function does is to create a new set of streams on the same subset of GPUs. Probably create_from would be better?

agnesLeroy · 2025-09-15T11:10:19Z

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r2 (raw file):

  // streams on the same GPU
  void assign_clone(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file)

agnesLeroy · 2025-09-15T11:10:46Z

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 116 at r2 (raw file):

    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

    cudaStream_t *streams_clone = new cudaStream_t[other._gpu_count];

streams_clone -> new_streams?

agnesLeroy · 2025-09-15T11:27:25Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 837 at r2 (raw file):

      lwe_trivial_indexes_vec.clear();
      if (active_gpus.count() > 1) {
        active_gpus.synchronize();

It's ok to use streams here and restrict the number of GPUs to active_gpu_count

agnesLeroy · 2025-09-15T11:28:19Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 923 at r2 (raw file):

    ///////////////
    active_gpus = streams.active_gpu_subset(num_radix_blocks);

Do we need to keep track of active_gpu_count in the int_noise_squashing_lut struct? I don't think so, this one we could remove altogether

agnesLeroy · 2025-09-15T11:29:15Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

Actually right now there are two:

residence_streams which are the ones coming from Rust

active_gpus which is a subset depending on the number of blocks
This seems too complicated: how about getting rid of the two?
When calling the constructor that does mem_reuse the set of streams should be the one coming from Rust. I think that's all we need to do. Having a set of streams encapsulated in the lut makes it more complicated to follow what's going on I think: given that we always recompute the active gpu count when broadcasting or calling execute_keyswitch / pbs this should work.

The only thing we need to keep in int_radix_lut is active_gpu_count because it's used in the release function.

agnesLeroy · 2025-09-15T11:30:31Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 335 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

The only thing we need to keep in int_radix_lut is active_gpu_count because it's used in the release function.

We could even get rid of this by looping over pbs_buffer.size() in the release function

agnesLeroy · 2025-09-15T11:31:25Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1017 at r2 (raw file):

  }
  void release(CudaStreams streams) {
    for (uint i = 0; i < active_gpus.count(); i++) {

How about looping over pbs_buffer.size() to avoid having to store active_gpu_count?

agnesLeroy · 2025-09-15T11:32:01Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1057 at r2 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    active_gpus.synchronize();

It's ok if we do streams.synchronize(); here, it shouldn't impact performance at all.

agnesLeroy · 2025-09-15T11:32:59Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1017 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

How about looping over pbs_buffer.size() to avoid having to store active_gpu_count?

We could define

auto active_gpu_count = pbs_buffer.size();

agnesLeroy · 2025-09-15T11:33:18Z

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1057 at r2 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    active_gpus.synchronize();

And here we could keep the loop over active_gpu_count

agnesLeroy

@agnesLeroy reviewed 2 of 66 files at r1, 2 of 3 files at r2, all commit messages.
Reviewable status: 4 of 66 files reviewed, 26 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):

        lut_indexes, new_lut_indexes, new_num_blocks * sizeof(Torus),
        streams.stream(0), streams.gpu_index(0), gpu_memory_allocated);
    auto new_active_gpus = streams.active_gpu_subset(new_num_blocks);

new_active_gpus -> active_streams?

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2513 at r2 (raw file):

          gpu_memory_allocated);

      auto active_gpus = streams.active_gpu_subset(1);

For this variable name I'd suggest
active_gpus -> active_streams
(applies everywhere)

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2831 at r2 (raw file):

  int_radix_params params;

  CudaStreams active_gpus;

active_gpu_count was used to track how many sub streams and events are created, I think we should keep the logic and only store the number of active gpus for substreams & events here.

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2922 at r2 (raw file):

    outgoing_events2 =
        (cudaEvent_t *)malloc(active_gpus.count() * sizeof(cudaEvent_t));
    for (uint j = 0; j < active_gpus.count(); j++) {

Again, you can loop over active_gpu_count and use streams as a variable below

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2979 at r2 (raw file):

  CudaStreams true_streams;
  CudaStreams false_streams;
  CudaStreams active_gpus;

Here there are no events so we could get rid of active_gp_count in the struct altogether, no need to store active_gpus

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3381 at r2 (raw file):

  CudaStreams local_streams_1;
  CudaStreams local_streams_2;
  CudaStreams active_gpus;

You can remove this from the struct, no events so no need to keep track of active_gpu_count in the struct anymore

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):

        2 * num_radix_blocks * sizeof(Torus), streams.stream(0),
        streams.gpu_index(0), allocate_gpu_memory);
    auto active_gpus_pred = streams.active_gpu_subset(2 * num_radix_blocks);

active_gpus -> active_streams
(everywhere)

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4017 at r2 (raw file):

  CudaStreams lsb_streams;
  CudaStreams msb_streams;
  CudaStreams active_gpus;

Same here you can remove it from the struct

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 4223 at r2 (raw file):

template <typename Torus> struct unsigned_int_div_rem_memory {
  int_radix_params params;
  CudaStreams active_gpus;

Again here there are no events no need to keep this here.

agnesLeroy

Ok, overall it's starting to look really good, thanks a lot for the changes @andrei-stoian-zama

Reviewable status: 4 of 66 files reviewed, 28 unresolved discussions (waiting on @andrei-stoian-zama, @IceTDrinker, @soonum, and @tmontaigu)

backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):

    PANIC("Cuda error: number of gpus in scatter should be <= number of gpus "
          "used to create the lut")
  dest.resize(streams.count());

So isn't it possible to remove thise dest.resize()?

tfhe/src/high_level_api/integers/signed/tests/gpu.rs line 159 at r2 (raw file):

    let gt_tmp_buffer_size = a.get_gt_size_on_gpu(b);
    println!("1");

Need to remove all the println from here

IceTDrinker

For the core part there isn't much to say so I guess ok like this

@IceTDrinker reviewed 3 of 66 files at r1, 1 of 3 files at r2.
Reviewable status: 7 of 66 files reviewed, 28 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)

Makefile line 1001 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

I don't think we need this new Makefile target, at least we don't have this on CPU so not sure we want to add it cc @IceTDrinker

targets don't have to be useful in CI, if it's a commonly used command can be ok to be honest

agnesLeroy

Still some minor changes, otherwise it looks really good. We should run the 64-bit latency / throughput benchmarks just to check that performance is not affected

@agnesLeroy reviewed 50 of 66 files at r1, 14 of 14 files at r3, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @agnesLeroy, @andrei-stoian-zama, @soonum, and @tmontaigu)

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

Instead of GPU_ASSERT we should use PANIC_IF_FALSE (applies everywhere in this file)

These are still some to be replaced it seems

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 2346 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

new_active_gpus -> active_streams?

Still some old naming new_active_gpus, should be new_active_streams

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 3628 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

active_gpus -> active_streams
(everywhere)

Some more old naming, also for the one below. That's why I use sed 😆

backends/tfhe-cuda-backend/cuda/src/utils/helper_multi_gpu.cuh line 152 at r2 (raw file):

Previously, agnesLeroy (Agnès Leroy) wrote…

So isn't it possible to remove thise dest.resize()?

I have this change in another PR, probably better not to do it here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 347 at r3 (raw file):

    ///////////////
    for (uint i = 0; i < active_streams.count(); i++) {
      cuda_set_device(streams.gpu_index(i));

Here it would probably be better to use active_streams instead of streams

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 355 at r3 (raw file):

      uint64_t size = 0;
      execute_scratch_pbs<Torus>(
          streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 368 at r3 (raw file):

    // We create the events only if we have multiple GPUs
    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 369 at r3 (raw file):

    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));
      event_broadcast = cuda_create_event(streams.gpu_index(0));

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 383 at r3 (raw file):

    for (uint i = 0; i < active_streams.count(); i++) {
      auto lut = (Torus *)cuda_malloc_with_size_tracking_async(
          num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 386 at r3 (raw file):

          size_tracker, allocate_gpu_memory);
      auto lut_indexes = (Torus *)cuda_malloc_with_size_tracking_async(
          lut_indexes_size, streams.stream(i), streams.gpu_index(i),

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 391 at r3 (raw file):

      // if a different behavior is wanted, it should be rewritten later
      cuda_memset_with_size_tracking_async(
          lut_indexes, 0, lut_indexes_size, streams.stream(i),

Same here

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 443 at r3 (raw file):

                                allocate_gpu_memory);
    cuda_synchronize_stream(streams.stream(0), streams.gpu_index(0));
    multi_gpu_copy_array_async(active_streams, lwe_trivial_indexes_vec,

Why is active_streams used here and not in the allocation functions? It would make more sense to use it for the multi gpu allocs as well no?

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 560 at r3 (raw file):

    active_streams = streams.active_gpu_subset(num_radix_blocks);
    for (uint i = 0; i < active_streams.count(); i++) {
      cuda_set_device(streams.gpu_index(i));

streams -> active_streams?

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 568 at r3 (raw file):

      uint64_t size = 0;
      execute_scratch_pbs<Torus>(
          streams.stream(i), streams.gpu_index(i), &gpu_pbs_buffer,

Again

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 580 at r3 (raw file):

    // We create the events only if we have multiple GPUs
    if (active_streams.count() > 1) {
      event_scatter_in = cuda_create_event(streams.gpu_index(0));

Again and the one below

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 594 at r3 (raw file):

    for (uint i = 0; i < active_streams.count(); i++) {
      auto lut = (Torus *)cuda_malloc_with_size_tracking_async(
          num_luts * lut_buffer_size, streams.stream(i), streams.gpu_index(i),

Again and the two below

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 775 at r3 (raw file):

  void release(CudaStreams streams) {
    GPU_ASSERT(lut_indexes_vec.size() == lut_vec.size(),

PANIC_IF_FALSE would be better I think

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 827 at r3 (raw file):

        multi_gpu_release_async(streams, lwe_after_pbs_vec);
        multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
        streams.synchronize();

I think we could remove this streams.synchronize, we don't need it as we synchronize on active streams below. This is some legacy synchronization I had introduced because I wasn't sure whether calling clear without synchronizing first would be ok, but it is as the GPU pointers have their own "life" on the GPU. One the drop instruction is issued from the CPU it's fine to clear.

backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h line 1054 at r3 (raw file):

    multi_gpu_release_async(streams, lwe_after_pbs_vec);
    multi_gpu_release_async(streams, lwe_trivial_indexes_vec);
    streams.synchronize();

Again I don't think we need to synchronize here. Maybe we could keep these two synchronize calls just to avoid introducing a bad bug in this PR and try to do it after in another PR just to be safer.

agnesLeroy · 2025-09-17T14:54:05Z

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r3 (raw file):

  // streams on the same GPU
  void create_on_same_gpus(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");

Still a GPU_ASSERT here

andrei-stoian-zama · 2025-09-17T19:57:46Z

backends/tfhe-cuda-backend/cuda/include/helper_multi_gpu.h line 114 at r3 (raw file):
  // streams on the same GPU
  void create_on_same_gpus(const CudaStreams &other) {
    GPU_ASSERT(_streams == nullptr, "Assign clone to non-empty cudastreams");
Still a GPU_ASSERT here

In reviewable if I select r4, which I would guess is the latest version, it seems it's not there anymore

agnesLeroy

Ok looks all good, let's address the comments about using active_streams and removing some synchronize in another PR 👍

cla-bot bot added the cla-signed label Sep 3, 2025

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch 5 times, most recently from a8d639d to bcb82e1 Compare September 11, 2025 12:41

andrei-stoian-zama added the approved label Sep 11, 2025

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch 2 times, most recently from 18a73c5 to 58f7e88 Compare September 11, 2025 19:21

zama-bot removed the approved label Sep 11, 2025

andrei-stoian-zama added the approved label Sep 11, 2025

andrei-stoian-zama marked this pull request as ready for review September 11, 2025 19:28

andrei-stoian-zama requested review from IceTDrinker, agnesLeroy, soonum and tmontaigu as code owners September 11, 2025 19:28

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch from 58f7e88 to abc5544 Compare September 15, 2025 09:07

zama-bot removed the approved label Sep 15, 2025

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch from abc5544 to 3c3ffb1 Compare September 15, 2025 09:13

andrei-stoian-zama added the approved label Sep 15, 2025

agnesLeroy requested changes Sep 15, 2025

View reviewed changes

IceTDrinker reviewed Sep 15, 2025

View reviewed changes

andrei-stoian-zama added approved and removed approved labels Sep 16, 2025

zama-bot removed the approved label Sep 16, 2025

andrei-stoian-zama added the approved label Sep 16, 2025

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch from 6455f02 to 3ced713 Compare September 17, 2025 09:03

zama-bot removed the approved label Sep 17, 2025

andrei-stoian-zama added the approved label Sep 17, 2025

andrei-stoian-zama requested a review from agnesLeroy September 17, 2025 12:48

agnesLeroy requested changes Sep 17, 2025

View reviewed changes

chore(gpu): structure to encapsulate streams

7e14941

andrei-stoian-zama force-pushed the as/struct_for_gpu_indices branch from 3ced713 to 7e14941 Compare September 17, 2025 14:50

zama-bot removed the approved label Sep 17, 2025

andrei-stoian-zama requested a review from agnesLeroy September 17, 2025 19:57

agnesLeroy approved these changes Sep 18, 2025

View reviewed changes

zama-bot added the approved label Sep 18, 2025

andrei-stoian-zama merged commit 1dcc3c8 into main Sep 18, 2025
117 of 119 checks passed

andrei-stoian-zama deleted the as/struct_for_gpu_indices branch September 18, 2025 07:43

chore(gpu): structure to encapsulate streams #2725

chore(gpu): structure to encapsulate streams #2725

Uh oh!

Conversation

andrei-stoian-zama commented Sep 3, 2025 • edited by IceTDrinker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IceTDrinker commented Sep 11, 2025

Uh oh!

IceTDrinker commented Sep 11, 2025

Uh oh!

agnesLeroy commented Sep 12, 2025

Uh oh!

IceTDrinker commented Sep 12, 2025

Uh oh!

agnesLeroy commented Sep 12, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy commented Sep 15, 2025

Uh oh!

agnesLeroy left a comment

Choose a reason for hiding this comment

Uh oh!

agnesLeroy left a comment

Choose a reason for hiding this comment

Uh oh!

IceTDrinker left a comment

Choose a reason for hiding this comment

Uh oh!

agnesLeroy left a comment

Choose a reason for hiding this comment

Uh oh!

agnesLeroy commented Sep 17, 2025

Uh oh!

andrei-stoian-zama commented Sep 17, 2025

Uh oh!

agnesLeroy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrei-stoian-zama commented Sep 3, 2025 •

edited by IceTDrinker

Loading