DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886

coldav · 2025-08-26T14:40:09Z

This brings the native cpu compiler pipeline files directly under the ownership
of intel/llvm. This removes the direct dependence on the oneAPI Construction
Kit, although the history of those files still exists under intel/llvm and the
originals still exist at
https://github.com/uxlfoundation/oneapi-construction-kit.

This updates the CMakeLists.txt to be appropriate for llvm components,
rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the
.h and .cpp files brought in. Other than that the .cpp and .h files are unchanged.

A brief explanation and limitations of the importing of the files are under l lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst.

It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.

[compiler] Rename HandleBarriersPass -> WorkItemLoopsPass

These mirror the corresponding builtins in Intel's SPV_INTEL_subgroups SPIR-V extension. This commit also provides support for lowering to these via the SPIR-V SubgroupShuffleINTEL capability, as part of the SPV_INTEL_subgroups extension. These builtins are currently not vectorized and thus only implemented for 'trivial' sub-groups of size 1. Support for wider sub-groups will come in later commits.

[compiler] Add mux sub-group shuffle builtins

This updates users of the vectorizer to set the desired vectorization factor according to the kernel's required sub-group size, if present. It does so using a common method, to ensure consistent behaviour.This helper method assumes that the current underlying 'mux' sub-group size is 1, as we haven't got any in-tree targets that make use of anything larger than that. All targets consider the required sub-group size less than they do `optnone` or `-cl-wfv=never` options. This can be revisited in the future if users desire different behaviour.

[compiler] Vectorize to any required sub-group size

These were missed when the pass was renamed.

This fixes a long-standing hack in the vectorizer where it would strip the vectorized kernels of their 'kernel' status. This wasn't anything to do with vectorization but was a legacy behaviour owing to the fact that the vectorizer used to replace the old functions. The `WorkItemLoopsPass` would in turn wrap every kernel it came across, and we didn't want it wrapping up the old 'scalar' kernel since we'd never call it: this would negatively impact compile time for no reason. However, with the revised sub-groups model and to handle the case where the target is not using degenerate sub-groups, we *do* want to wrap up the unvectorized scalar kernel as it is useful as a fallback in the same way the degenerate sub-group kernel is. To avoid compiling unnecessary kernel wrappers, the WorkItemLoopsPass has been taught to avoid wrapping scalar kernels if they don't use sub-group builtins, or if they're degenerate, or if the kernel has a required sub-group size. This should preserve the old behaviour in the vast majority of cases. The `VerifyReqdSubGroupSizeSatisfiedPass` must now be run later in the pipeline, after the `WorkItemLoopsPass` has run. This is because immediately after vectorization the old 'scalar' kernel appears as though the required sub-group size has not been met. It is only once the `WorkItemLoopsPass` has run and vectorized and unvectorized kernels have been merged that we have our 'final' kernel forms, and we can glean whether the compiler correctly satisfied the constraints. We could alternatively make the pass look at all the vectorized forms of each kernel and verify that *at least one* form of each kernel has met the requirement, but this ties its behaviour more closely with the vectorizer's output, whereas currently it's fairly agnostic: it runs over all kernel entry points.

This changes how the riscv/refsi targets handle sub-group sizes. The riscv/refsi targets now automatically vectorize kernels/functions containing explicit use of sub-groups to one of the device's advertised sub-group widths. These targets no longer run the `DegenerateSubGroupPass`! The rationale for this is as follows: The SYCL 2020 spec has `info::device::sub_group_sizes` which returns "the" list of sub-group sizes reported by the device. The CTS expects that if it runs a kernel which (for example) returns the kernel's sub-group size to the host, then that size is found in that same device list. This rules out the use of degenerate sub-groups. We'd have to list all the sub-group (read: work-group) sizes from 1 to 1024, which is useless for users. All of those sub-group sizes could then, in turn, be required by a user, and the compiler would effectively be unable to satisfy. A user requiring a sub-group size of 244 would lead to a very strange situation where the compiler must create a degenerate sub-groups kernel and would then just hope that the user uses a work-group size of 244. Anything else would be invalid. This is another bad user experience. The vectorizer can be asked to return a suitable vectorization factor for a kernel, given its use of sub-groups and the device's list of sub-group sizes. It is quite rudimentary in its heuristic and doesn't choose between legal powers of two: the first one is taken as the preferred list. I've filed a ticket for clarification/questions about sub-groups and degenerate sub-groups against the SYCL spec, but I think this is a suitable "workaround" for now, or even for the long term. The behaviour is arguably more intuitive for users.

[riscv][vecz] Auto-vectorize sub-group kernels to a device-specific sub-group size

We were alternatively under- and over-setting the alignment on vectorized loads. Under-setting the alignment is still correct if sub-optimal. However, the `SimplifyMaskedMemOpsPass` was ignoring the alignment of the scalar masked mem load/store builtins when optimizing to regular LLVM load and store instructions. This would over-set the alignment, meaning the compiler could assume properties about the address which were not true. This led to riscv reporting a misaligned store exception on some SYCL CTS tests.

[vecz] Fix alignment of load operations

The StrideAnalysisResult had its own local copy of an AssumptionCache. These were still live at the time we deleted instructions at the end of packetization, and/or tore down the vectorization process upon packetization failure. This meant that some llvm Values could have callback handles in two assumption caches, and when one was deleted the other would hold onto garbage memory. The fix is to use the same assumption cache in StrideAnalysisResult as is obtained from the FunctionAnalysisManager. An additional bug was that we were ostensibly invalidating all function analyses on deleted vectorized functions, but were accidentally inverting the set of preserved analyses such that all analyses were being *preserved*, rather than invalidated.

[vecz] Fix invalid dangling AssumptionCaches

This completes the set of work-group collective functions we can support via the 'alternative' pathway through the vectorizer and the work-item-loops pass. They join broadcasts and reductions, which were added a while back. The scans are implemented using a linear work-item loop, accumulating with each iteration of the 'X' loop. The vectorizer vectorizes them like the sub-group scans, so the accumulator used is just the current/previous value of scan operation so far. The vectorized work-group merges this with the partial scan over the vector group. The 'scans-only' parameter to the replace-wgc pass has been removed. The pass is either run or it isn't run, depending on what the target wants to do to support the work-group collective operations. Only the RefSi G1 WIPT example uses the old pass, though it should still be tested in LIT. In unofficial local testing, these scans appear to be around 10% faster to complete the associated SYCL-CTS tests. This requires more investigation, however.

[compiler] Add work-group scan support in vecz/work-item-loops

The inlining for memcpy was incorrectly using the destination parameter's alignment rather than the source parameter, resulting in miscompiles. As well as fixing this issue, a test has been added.

[vecz] Use correct alignment for memcpy source The inlining for memcpy was incorrectly using the destination parameter's alignment rather than the source parameter, resulting in miscompiles. As well as fixing this issue, a test has been added.

These builtins function as barriers, so masking them produces undesirable results when laying out barrier regions. For this reason, we don't mask 'regular' barriers. These builtins are, in any case, uniform/convergent, so either all work-items or no work-items should reach that point of execution. Also ensure the spir-v reduction wrapper is marked as convergent, for completeness' sake.

[vecz] Don't mask work-group collective operations

This extends fixed-width vectorization capabilities to `__mux_sub_group_shuffle` builtins, but only those with uniform indices (where the shuffle index is the same for all invocations in the sub-group). This accounts for the majority of those tested by the SYCL-CTS. Support for varying indices will come down the line, once the other shuffles are covered under similar conditions. The existing sub-group LIT tests have been split by operation, as they are expected to grow significantly to cover all of the different conditions we can vectorize under.

This extends fixed-width vectorization capabilities to `__mux_sub_group_shuffle_xor` builtins. This isn't something that is very efficiently vectorized, because of all of the runtime indexing, which no built-in LLVM instructions/intrinsics can really make use of. It might be preferable for some targets to go through memory. We might want to make that a codegen option in a future update.

[vecz] Packetize sub-group shuffle_xor builtins

Forgetting to normalize the broadcast ID back into the 'mux' range would produce incorrect results on an implementation where the mux sub-groups were non-trivial, or where bounds were checked. This is, as such, not currently a visible problem in the default implementation of these builtins, but rather a theoretical one.

We were generating undef in many places because in older versions of LLVM, poison did not yet exist. As we no longer support those older versions of LLVM, and current LLVM is deprecating undef, switch over to poison.

[NFC] Update to clang-format/clang-tidy 20.

We were using 'and' to filter branch conditions, trusting that if an entry mask had it set to false, it would assuredly be kept false. This does not work when a branch condition is computed as 'poison', as 'false and poison' evaluates to 'poison', not 'false'. We can generate 'select' instructions instead, which avoid this problem.

Further updates to account for the deprecation of undef: this updates documentation and variable names in line with the previous update to use poison more. It updates tests to no longer use undef where we would no longer generate undef. It restores one use of UndefValue::get where the replacement with PoisonValue::get was wrong: SPIR-V's OpUndef does not allow treating it as poison.

More poison/undef updates.

This brings the native cpu compiler pipeline files directly under the ownership of intel/llvm. This removes the direct dependence on the oneAPI Construction Kit, although the history of those files still exists under intel/llvm and the originals still exist at https://github.com/uxlfoundation/oneapi-construction-kit. This is the post merge update to the the oneAPI Construction Kit move of the compiler pipeline files for Native CPU and replaces the original FetchContent method. This updates the CMakeLists.txt to be appropriate for llvm components, rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the .h and .cpp files brought in. Other than that the .cpp and .h files are unchanged. A brief explanation and limitations of the importing of the files are under l lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst. It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.

frasercrmck and others added 30 commits August 31, 2023 15:26

Merge pull request intel#118 from frasercrmck/rename-barriers-pass

3bec1d1

[compiler] Rename HandleBarriersPass -> WorkItemLoopsPass

Merge pull request intel#124 from frasercrmck/subgroup-shuffles

7a67316

[compiler] Add mux sub-group shuffle builtins

Merge pull request intel#125 from frasercrmck/vecz-reqd-sub-group-size

1d4b9f6

[compiler] Vectorize to any required sub-group size

[doc] Fix up reference to 'barrier pass'

9ba52a7

These were missed when the pass was renamed.

[vecz] Vectorize to the reqd sub-group size under 'auto'

09b2af7

Merge pull request intel#134 from frasercrmck/vecz-sub-group-sizes

88a9d6b

[riscv][vecz] Auto-vectorize sub-group kernels to a device-specific sub-group size

Merge pull request intel#138 from frasercrmck/fix-vecz-alignment

ef9e48d

[vecz] Fix alignment of load operations

Merge pull request intel#141 from frasercrmck/fix-assumption-cache

5ba1947

[vecz] Fix invalid dangling AssumptionCaches

Merge pull request intel#145 from frasercrmck/replace-wgc-scans

6bcc663

[compiler] Add work-group scan support in vecz/work-item-loops

[vecz] Use correct alignment for memcpy source

07b6942

The inlining for memcpy was incorrectly using the destination parameter's alignment rather than the source parameter, resulting in miscompiles. As well as fixing this issue, a test has been added.

Merge pull request intel#162 from frasercrmck/masked-group-ops

6522413

[vecz] Don't mask work-group collective operations

Merge pull request intel#181 from frasercrmck/vecz-subgroup-shuffle-xors

b146350

[vecz] Packetize sub-group shuffle_xor builtins

[multi_llvm] Remove ArrayRef helpers

3078a60

[multi_llvm] Remove TypeSize helpers

d073cf2

[multi_llvm] Remove insertInto helpers

6713920

[multi_llvm] Remove InlineFunction helper

f96641e

[multi_llvm] Remove addVectorizableFunctionsFromVecLib helper

17731cd

[multi_llvm] Clean up unused includes

9989cbf

coldav temporarily deployed to WindowsCILock August 26, 2025 15:02 — with GitHub Actions Inactive

coldav temporarily deployed to WindowsCILock August 26, 2025 15:42 — with GitHub Actions Inactive

coldav had a problem deploying to WindowsCILock August 26, 2025 15:42 — with GitHub Actions Failure

coldav force-pushed the colin/add_ock_into_dpcpp branch from 5f8574c to 729e7b5 Compare August 27, 2025 08:35

coldav requested a review from a team as a code owner August 27, 2025 08:35

coldav temporarily deployed to WindowsCILock August 27, 2025 08:35 — with GitHub Actions Inactive

coldav temporarily deployed to WindowsCILock August 27, 2025 09:28 — with GitHub Actions Inactive

Use poison rather than undef.

b4c431c

We were generating undef in many places because in older versions of LLVM, poison did not yet exist. As we no longer support those older versions of LLVM, and current LLVM is deprecating undef, switch over to poison.

coldav temporarily deployed to WindowsCILock August 27, 2025 09:35 — with GitHub Actions Inactive

hvdijk and others added 6 commits August 27, 2025 12:04

[NFC] Update to clang-format/clang-tidy 20.

d586ef0

Merge pull request intel#954 from hvdijk/clang-tools-20

f4a37f3

[NFC] Update to clang-format/clang-tidy 20.

Merge pull request intel#962 from hvdijk/more-poison-undef

f0962a5

More poison/undef updates.

[NATIVE_CPU][SYCL] Merge from oneAPI Construction Kit into native_cpu

d316142

coldav force-pushed the colin/add_ock_into_dpcpp branch from 729e7b5 to 63d7eb8 Compare August 29, 2025 13:54

coldav temporarily deployed to WindowsCILock August 29, 2025 13:54 — with GitHub Actions Inactive

coldav had a problem deploying to WindowsCILock August 29, 2025 14:21 — with GitHub Actions Failure

coldav temporarily deployed to WindowsCILock August 29, 2025 14:37 — with GitHub Actions Inactive

coldav force-pushed the colin/add_ock_into_dpcpp branch from 63d7eb8 to 0b99891 Compare August 29, 2025 15:02

coldav temporarily deployed to WindowsCILock August 29, 2025 15:02 — with GitHub Actions Inactive

coldav temporarily deployed to WindowsCILock August 29, 2025 15:44 — with GitHub Actions Inactive

coldav temporarily deployed to WindowsCILock August 29, 2025 16:23 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886

DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886

Uh oh!

coldav commented Aug 26, 2025

Uh oh!

Uh oh!

DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886

Are you sure you want to change the base?

DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886

Uh oh!

Conversation

coldav commented Aug 26, 2025

Uh oh!

Uh oh!