Eval bug: Asynchronous Kernel Execution on iGPU Causes Runtime Errors with MOE Model

### Name and Version

version: 5852 (6e2f8df3)
built with MSVC 19.44.35211.0 for Windows AMD64

### Operating systems

Windows

### GGML backends

SYCL

### Hardware

**Device**: Intel(R) Arc(TM) Graphics
**Processor**: Intel(R) Core(TM) Ultra 5 125H, 4500 MHz, 14 cores, 18 threads

### Models

OLMoE-1B-7B-0924.Q4_0.gguf
Qwen3-30B-A3B-128K-Q3_K_S.gguf

### Problem description & steps to reproduce

**Problem Description:**
When running the MOE model on Intel(R) Arc(TM) Graphics (Intel Core Ultra 5 125H) using the following command:
`build-x64-windows-sycl-release-f16\bin\llama-simple.exe -m "D:\Download\OLMoE-1B-7B-0924.Q4_0.gguf" -ngl 99`
I encountered the following runtime error:
`GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as) failed`
This error occurs during kernel execution and causes the program to crash.
**Steps to Reproduce:**
Run the following command with the MOE model:
`build-x64-windows-sycl-release-f16\bin\llama-simple.exe -m "D:\Download\OLMoE-1B-7B-0924.Q4_0.gguf" -ngl 99
build-x64-windows-sycl-release-f16\bin\llama-simple.exe -m "D:\Download\Qwen3-30B-A3B-128K-Q3_K_S.gguf" -ngl 99`
The program outputs the error:
```
llama_context:      SYCL0 compute buffer size =    12.78 MiB
llama_context:  SYCL_Host compute buffer size =     1.52 MiB
llama_context: graph nodes  = 998
llama_context: graph splits = 2
Hello my name isD:\download\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3381: GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as) failed
```
**Proposed Solution:**
After adding wait() to the kernel launch, the issue is resolved. However, I am unsure why this issue has not been encountered by others. This could potentially be a rare or hardware-specific issue, as I have not seen similar reports from the community. I suspect this might be an edge case, and the cause could be related to the asynchronous execution on iGPU, which may not be synchronized correctly in some specific scenarios.
```
static void argsort_f32_i32_sycl(const float *x, int *dst, const int ncols,
                                 const int nrows, ggml_sort_order order,
                                 queue_ptr stream) {
    // bitonic sort requires ncols to be power of 2
    const int ncols_pad = next_power_of_2(ncols);

    const sycl::range<3> block_dims(1, 1, ncols_pad);
    const sycl::range<3> block_nums(1, nrows, 1);
    const size_t shared_mem = ncols_pad * sizeof(int);

    if (order == GGML_SORT_ORDER_ASC) {
        sycl_launch(stream, [&](sycl::handler & cgh) {
            sycl::local_accessor<uint8_t, 1> dpct_local_acc_ct1(
                sycl::range<1>(shared_mem), cgh);

            sycl_parallel_for(
                cgh, sycl::nd_range<3>(block_nums * block_dims, block_dims), [=](sycl::nd_item<3> item_ct1) {
                    k_argsort_f32_i32<GGML_SORT_ORDER_ASC>(
                        x, dst, ncols, ncols_pad, item_ct1,
                        dpct_local_acc_ct1.get_multi_ptr<sycl::access::decorated::no>()
                            .get());
                });
        });
    } else if (order == GGML_SORT_ORDER_DESC) {
        sycl_launch(stream, [&](sycl::handler & cgh) {
            sycl::local_accessor<uint8_t, 1> dpct_local_acc_ct1(
                sycl::range<1>(shared_mem), cgh);

            sycl_parallel_for(
                cgh, sycl::nd_range<3>(block_nums * block_dims, block_dims), [=](sycl::nd_item<3> item_ct1) {
                    k_argsort_f32_i32<GGML_SORT_ORDER_DESC>(
                        x, dst, ncols, ncols_pad, item_ct1,
                        dpct_local_acc_ct1.get_multi_ptr<sycl::access::decorated::no>()
                            .get());
                });
        });
    } else {
        GGML_ABORT("fatal error");
    }

    // Ensure all kernels finish execution before proceeding further
    stream->wait();  // This line was added to synchronize and wait for kernel completion
}
```



### First Bad Commit

_No response_

### Relevant log output

```shell
Command:
build-x64-windows-sycl-release-f16\bin\llama-simple.exe -m "D:\Download\OLMoE-1B-7B-0924.Q4_0.gguf" -ngl 99

Log Output:
llama_context: SYCL0 compute buffer size = 12.78 MiB
llama_context: SYCL_Host compute buffer size = 1.52 MiB
llama_context: graph nodes = 998
llama_context: graph splits = 2
Hello my name is
D:\download\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3381: GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as) failed
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Asynchronous Kernel Execution on iGPU Causes Runtime Errors with MOE Model #15580

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Asynchronous Kernel Execution on iGPU Causes Runtime Errors with MOE Model #15580

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions