CANN: fix RoPE cache issue on multi-device #15629

hipudding · 2025-08-28T07:42:09Z

RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors.

This commit records the first layer of each device to avoid the above issues.

Update
To avoid the RoPE cache being overly coupled to a specific model, we currently only cache those entries that can be determined, from the input parameters, not to undergo any transformation.

./bin/test-backend-ops test -b CANN0 -o ROPE
Testing 3 devices
Backend 1/3: CANN0
Device description: Ascend910B4
Device memory: 30196 MB (29802 MB free)
11837/11837 tests passed
Backend CANN0: OK
Backend 2/3: CANN1
Skipping
Backend 3/3: CPU
Skipping
3/3 backends passed
OK

Make sure to read the contributing guidelines before submitting a PR

RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues.

ggerganov · 2025-08-28T11:08:02Z

ggml/src/ggml-cann/aclnn_ops.cpp

+    // get first layer in current device.
+    int layer = 0;
+    const char* dash = std::strchr(dst->name, '-');
+    if (dash) {
+        layer = std::strtol(dash + 1, nullptr, 10);
+    }
+
+    // remember the first layer.
+    if(ctx.rope_cache.first_layer == -1)
+        ctx.rope_cache.first_layer = layer;

-    int64_t theta_scale_length = ne00 / 2;
+    // only init cache when freq_factors is not null or first layer.
+    // dash == nullptr means we are in test-backend-ops
+    if(dash != nullptr && src2 == nullptr && layer != ctx.rope_cache.first_layer) {
+        // use cache.
+        return;
+    }


This is really hacky. Can you improve without making assumptions about the tensor names? Maybe create the cache based on the input parameters?

I’ve tried, but during the decode stage, it’s not possible to determine based on the shape and data of position, because all position lengths are the same, and position itself, as well as running position->data, are the same too. The only difference is the data inside position, but copying data from the device to the host is not a good approach. Do you have any good suggestions for this?

I think I’ve come up with a method: during the forward computation in ggml_cgraph, add a marker when encountering the first RoPE operator and perform the cache calculation. The subsequent RoPE operators would then skip the computation. This way, we can avoid parsing the tensor’s name. I will try this way.

noemotiovon · 2025-08-28T11:28:14Z

The current scenario is that when computing the sine/cosine for the rope, it is calculated only for the first layer on each device, and other layers are reused. It is necessary to identify which layer is the first layer on the current device. However, currently there is no way to obtain the layer number within the device backend, so it can only be inferred from the name. Would it be possible to store the layer number information directly in the tensor?

noemotiovon

LGTM!

noemotiovon · 2025-08-29T02:33:59Z

This is an excellent refactor that removes the previous hacky implementation! We’ll do another refactor later for the case where src2 != nullptr.

ggerganov

AFAIU the cache will be computed for every graph_compute() call on the first rope operation and then will be reused for all remaining rope operations in the current graph. I think this assumes that all ropes are the same in all layers. In general this is not guaranteed and will likely cause problems in the future.

hipudding · 2025-08-29T09:32:10Z

AFAIU the cache will be computed for every graph_compute() call on the first rope operation and then will be reused for all remaining rope operations in the current graph. I think this assumes that all ropes are the same in all layers. In general this is not guaranteed and will likely cause problems in the future.

As far as I know, apart from freq_factors, the RoPE cache used for each token only depends on the position and some hyperparameters. So far, I haven’t observed any cases where the cache differs across layers. Could you clarify in what situations the RoPE cache would be layer-dependent? Thanks!

ggerganov · 2025-08-29T10:49:13Z

For example Gemma3n uses different freq_base for the different layers:

llama.cpp/src/llama-model.cpp

Lines 10635 to 10639 in f15d515

    
           for (int il = 0; il < n_layer; ++il) { 
        
               // this block is made to be closely resemble Gemma3p5DecoderLayer on python code 
        
               const float freq_base_l  = model.get_rope_freq_base (cparams, il); 
        
               const float freq_scale_l = model.get_rope_freq_scale(cparams, il);

Also, this simple test program would likely not run correctly if it was offloaded to the CANN backend (currently it always runs on the CPU):

llama.cpp/tests/test-rope.cpp

Lines 156 to 177 in f15d515

    
           if (m < 3) { 
        
               struct ggml_tensor * p0 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]); 
        
               struct ggml_tensor * p1 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]); 
        
               struct ggml_tensor * p2 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]); 
        
               for (int i = 0; i < ne[2]; ++i) { 
        
                   ((int32_t *) p0->data)[i] = n_past_0 + i; 
        
                   ((int32_t *) p1->data)[i] = n_past_2 - n_past_0; 
        
                   ((int32_t *) p2->data)[i] = n_past_2 + i; 
        
               } 
        
               // test mode 0, 2, 4 (standard, GPT-NeoX, GLM) 
        
               mode = m == 0 ? 0 : m == 1 ? 2 : 4; 
        
               // 100, 101, 102, ..., 172 
        
               r0 = ggml_rope(ctx0, x,  p0, n_rot, mode); 
        
               // -67, -67, -67, ..., -67 
        
               r1 = ggml_rope(ctx0, r0, p1, n_rot, mode); // "context swap", i.e. forget n_past_0 - n_past_2 tokens 
        
               //  33,  34,  35, ..., 105 
        
               r2 = ggml_rope(ctx0, x,  p2, n_rot, mode); 
        
           } else { 
        
               // testing multi-dimension rope position embedding mode

Generally, we want to keep the code generic and not make assumptions about the application. llama.cpp is just one of the applications that use ggml and other applications are possible to call ggml_rope with different parameters within the same graph.

… the parameters.

hipudding · 2025-08-30T02:15:10Z

@ggerganov Thank you for the reminder. All unsafe caches have been removed, and only the parts that can be determined through parameters to remain unchanged are cached.

noemotiovon · 2025-08-30T02:35:31Z

LGTM. However, for transformer models, removing the sin/cos cache in ROPE leads to a performance drop compared to before. We’ll need to explore a more elegant way to determine positional information in the future to ensure the cache check remains precise.

* CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 28, 2025

hipudding requested review from ggerganov and slaren August 28, 2025 10:55

ggerganov reviewed Aug 28, 2025

View reviewed changes

hipudding self-assigned this Aug 28, 2025

hipudding requested a review from ggerganov August 29, 2025 00:59

CANN: Optimize first-layer detection method

24d43d3

noemotiovon approved these changes Aug 29, 2025

View reviewed changes

CANN: Remove trailing whitespace

602563c

ggerganov reviewed Aug 29, 2025

View reviewed changes

CANN: Only cache the data that can be determined as unchanged through…

1321c2c

… the parameters.

CANN: Update function comment

ce56f80

hipudding requested review from ggerganov and noemotiovon August 30, 2025 02:29

ggerganov approved these changes Aug 31, 2025

View reviewed changes

hipudding merged commit 3dc7397 into ggml-org:master Sep 1, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CANN: fix RoPE cache issue on multi-device #15629

CANN: fix RoPE cache issue on multi-device #15629

Uh oh!

hipudding commented Aug 28, 2025 •

edited

Loading

Uh oh!

ggerganov Aug 28, 2025

Uh oh!

hipudding Aug 28, 2025 •

edited

Loading

Uh oh!

hipudding Aug 29, 2025

Uh oh!

noemotiovon commented Aug 28, 2025

Uh oh!

noemotiovon left a comment

Uh oh!

noemotiovon commented Aug 29, 2025

Uh oh!

ggerganov left a comment

Uh oh!

hipudding commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 29, 2025

Uh oh!

hipudding commented Aug 30, 2025

Uh oh!

noemotiovon commented Aug 30, 2025

Uh oh!

Uh oh!

Uh oh!

CANN: fix RoPE cache issue on multi-device #15629

CANN: fix RoPE cache issue on multi-device #15629

Uh oh!

Conversation

hipudding commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

hipudding Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hipudding Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

noemotiovon commented Aug 28, 2025

Uh oh!

noemotiovon left a comment

Choose a reason for hiding this comment

Uh oh!

noemotiovon commented Aug 29, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

hipudding commented Aug 29, 2025

Uh oh!

ggerganov commented Aug 29, 2025

Uh oh!

hipudding commented Aug 30, 2025

Uh oh!

noemotiovon commented Aug 30, 2025

Uh oh!

Uh oh!

Uh oh!

hipudding commented Aug 28, 2025 •

edited

Loading

hipudding Aug 28, 2025 •

edited

Loading