Skip to content

Conversation

hipudding
Copy link
Collaborator

@hipudding hipudding commented Aug 28, 2025

RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors.

This commit records the first layer of each device to avoid the above issues.

Update
To avoid the RoPE cache being overly coupled to a specific model, we currently only cache those entries that can be determined, from the input parameters, not to undergo any transformation.

./bin/test-backend-ops test -b CANN0 -o ROPE
Testing 3 devices
Backend 1/3: CANN0
Device description: Ascend910B4
Device memory: 30196 MB (29802 MB free)
11837/11837 tests passed
Backend CANN0: OK
Backend 2/3: CANN1
Skipping
Backend 3/3: CPU
Skipping
3/3 backends passed
OK

Make sure to read the contributing guidelines before submitting a PR

RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.

This commit records the first layer of each device to avoid
the above issues.
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 28, 2025
@hipudding hipudding requested review from ggerganov and slaren August 28, 2025 10:55
Comment on lines 2274 to 2290
// get first layer in current device.
int layer = 0;
const char* dash = std::strchr(dst->name, '-');
if (dash) {
layer = std::strtol(dash + 1, nullptr, 10);
}

// remember the first layer.
if(ctx.rope_cache.first_layer == -1)
ctx.rope_cache.first_layer = layer;

int64_t theta_scale_length = ne00 / 2;
// only init cache when freq_factors is not null or first layer.
// dash == nullptr means we are in test-backend-ops
if(dash != nullptr && src2 == nullptr && layer != ctx.rope_cache.first_layer) {
// use cache.
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really hacky. Can you improve without making assumptions about the tensor names? Maybe create the cache based on the input parameters?

Copy link
Collaborator Author

@hipudding hipudding Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve tried, but during the decode stage, it’s not possible to determine based on the shape and data of position, because all position lengths are the same, and position itself, as well as running position->data, are the same too. The only difference is the data inside position, but copying data from the device to the host is not a good approach. Do you have any good suggestions for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I’ve come up with a method: during the forward computation in ggml_cgraph, add a marker when encountering the first RoPE operator and perform the cache calculation. The subsequent RoPE operators would then skip the computation. This way, we can avoid parsing the tensor’s name. I will try this way.

@noemotiovon
Copy link
Collaborator

The current scenario is that when computing the sine/cosine for the rope, it is calculated only for the first layer on each device, and other layers are reused. It is necessary to identify which layer is the first layer on the current device. However, currently there is no way to obtain the layer number within the device backend, so it can only be inferred from the name. Would it be possible to store the layer number information directly in the tensor?

@hipudding hipudding self-assigned this Aug 28, 2025
@hipudding hipudding requested a review from ggerganov August 29, 2025 00:59
Copy link
Collaborator

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@noemotiovon
Copy link
Collaborator

This is an excellent refactor that removes the previous hacky implementation! We’ll do another refactor later for the case where src2 != nullptr.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU the cache will be computed for every graph_compute() call on the first rope operation and then will be reused for all remaining rope operations in the current graph. I think this assumes that all ropes are the same in all layers. In general this is not guaranteed and will likely cause problems in the future.

@hipudding
Copy link
Collaborator Author

AFAIU the cache will be computed for every graph_compute() call on the first rope operation and then will be reused for all remaining rope operations in the current graph. I think this assumes that all ropes are the same in all layers. In general this is not guaranteed and will likely cause problems in the future.

As far as I know, apart from freq_factors, the RoPE cache used for each token only depends on the position and some hyperparameters. So far, I haven’t observed any cases where the cache differs across layers. Could you clarify in what situations the RoPE cache would be layer-dependent? Thanks!

@ggerganov
Copy link
Member

For example Gemma3n uses different freq_base for the different layers:

llama.cpp/src/llama-model.cpp

Lines 10635 to 10639 in f15d515

for (int il = 0; il < n_layer; ++il) {
// this block is made to be closely resemble Gemma3p5DecoderLayer on python code
const float freq_base_l = model.get_rope_freq_base (cparams, il);
const float freq_scale_l = model.get_rope_freq_scale(cparams, il);

Also, this simple test program would likely not run correctly if it was offloaded to the CANN backend (currently it always runs on the CPU):

if (m < 3) {
struct ggml_tensor * p0 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]);
struct ggml_tensor * p1 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]);
struct ggml_tensor * p2 = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, ne[2]);
for (int i = 0; i < ne[2]; ++i) {
((int32_t *) p0->data)[i] = n_past_0 + i;
((int32_t *) p1->data)[i] = n_past_2 - n_past_0;
((int32_t *) p2->data)[i] = n_past_2 + i;
}
// test mode 0, 2, 4 (standard, GPT-NeoX, GLM)
mode = m == 0 ? 0 : m == 1 ? 2 : 4;
// 100, 101, 102, ..., 172
r0 = ggml_rope(ctx0, x, p0, n_rot, mode);
// -67, -67, -67, ..., -67
r1 = ggml_rope(ctx0, r0, p1, n_rot, mode); // "context swap", i.e. forget n_past_0 - n_past_2 tokens
// 33, 34, 35, ..., 105
r2 = ggml_rope(ctx0, x, p2, n_rot, mode);
} else {
// testing multi-dimension rope position embedding mode

Generally, we want to keep the code generic and not make assumptions about the application. llama.cpp is just one of the applications that use ggml and other applications are possible to call ggml_rope with different parameters within the same graph.

@hipudding
Copy link
Collaborator Author

@ggerganov Thank you for the reminder. All unsafe caches have been removed, and only the parts that can be determined through parameters to remain unchanged are cached.

@noemotiovon
Copy link
Collaborator

LGTM. However, for transformer models, removing the sin/cos cache in ROPE leads to a performance drop compared to before. We’ll need to explore a more elegant way to determine positional information in the future to ensure the cache check remains precise.

@hipudding hipudding merged commit 3dc7397 into ggml-org:master Sep 1, 2025
49 checks passed
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
* CANN: fix RoPE cache issue on multi-device

RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.

This commit records the first layer of each device to avoid
the above issues.

* CANN: Optimize first-layer detection method

* CANN: Remove trailing whitespace

* CANN: Only cache the data that can be determined as unchanged through the parameters.

* CANN: Update function comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants