Skip to content

Conversation

coldav
Copy link
Contributor

@coldav coldav commented Aug 26, 2025

This brings the native cpu compiler pipeline files directly under the ownership
of intel/llvm. This removes the direct dependence on the oneAPI Construction
Kit, although the history of those files still exists under intel/llvm and the
originals still exist at
https://github.com/uxlfoundation/oneapi-construction-kit.

This updates the CMakeLists.txt to be appropriate for llvm components,
rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the
.h and .cpp files brought in. Other than that the .cpp and .h files are unchanged.

A brief explanation and limitations of the importing of the files are under l lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst.

It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.

frasercrmck and others added 30 commits August 31, 2023 15:26
[compiler] Rename HandleBarriersPass -> WorkItemLoopsPass
These mirror the corresponding builtins in Intel's SPV_INTEL_subgroups
SPIR-V extension.

This commit also provides support for lowering to these via the SPIR-V
SubgroupShuffleINTEL capability, as part of the SPV_INTEL_subgroups
extension.

These builtins are currently not vectorized and thus only implemented
for 'trivial' sub-groups of size 1. Support for wider sub-groups will
come in later commits.
[compiler] Add mux sub-group shuffle builtins
This updates users of the vectorizer to set the desired vectorization
factor according to the kernel's required sub-group size, if present. It
does so using a common method, to ensure consistent behaviour.This
helper method assumes that the current underlying 'mux' sub-group size
is 1, as we haven't got any in-tree targets that make use of anything
larger than that.

All targets consider the required sub-group size less than they do
`optnone` or `-cl-wfv=never` options. This can be revisited in the
future if users desire different behaviour.
[compiler] Vectorize to any required sub-group size
These were missed when the pass was renamed.
This fixes a long-standing hack in the vectorizer where it would strip
the vectorized kernels of their 'kernel' status.

This wasn't anything to do with vectorization but was a legacy behaviour
owing to the fact that the vectorizer used to replace the old functions.
The `WorkItemLoopsPass` would in turn wrap every kernel it came across,
and we didn't want it wrapping up the old 'scalar' kernel since we'd
never call it: this would negatively impact compile time for no reason.

However, with the revised sub-groups model and to handle the case where
the target is not using degenerate sub-groups, we *do* want to wrap up
the unvectorized scalar kernel as it is useful as a fallback in the same
way the degenerate sub-group kernel is.

To avoid compiling unnecessary kernel wrappers, the WorkItemLoopsPass
has been taught to avoid wrapping scalar kernels if they don't use
sub-group builtins, or if they're degenerate, or if the kernel has a
required sub-group size. This should preserve the old behaviour in the
vast majority of cases.

The `VerifyReqdSubGroupSizeSatisfiedPass` must now be run later in the
pipeline, after the `WorkItemLoopsPass` has run. This is because
immediately after vectorization the old 'scalar' kernel appears as
though the required sub-group size has not been met. It is only once the
`WorkItemLoopsPass` has run and vectorized and unvectorized kernels have
been merged that we have our 'final' kernel forms, and we can glean
whether the compiler correctly satisfied the constraints.

We could alternatively make the pass look at all the vectorized forms of
each kernel and verify that *at least one* form of each kernel has met
the requirement, but this ties its behaviour more closely with the
vectorizer's output, whereas currently it's fairly agnostic: it runs
over all kernel entry points.
This changes how the riscv/refsi targets handle sub-group sizes. The
riscv/refsi targets now automatically vectorize kernels/functions
containing explicit use of sub-groups to one of the device's advertised
sub-group widths.

These targets no longer run the `DegenerateSubGroupPass`!

The rationale for this is as follows: The SYCL 2020 spec has
`info::device::sub_group_sizes` which returns "the" list of sub-group
sizes reported by the device. The CTS expects that if it runs a kernel
which (for example) returns the kernel's sub-group size to the host,
then that size is found in that same device list.

This rules out the use of degenerate sub-groups. We'd have to list all
the sub-group (read: work-group) sizes from 1 to 1024, which is useless
for users. All of those sub-group sizes could then, in turn, be required
by a user, and the compiler would effectively be unable to satisfy. A
user requiring a sub-group size of 244 would lead to a very strange
situation where the compiler must create a degenerate sub-groups kernel
and would then just hope that the user uses a work-group size of 244.
Anything else would be invalid. This is another bad user experience.

The vectorizer can be asked to return a suitable vectorization factor
for a kernel, given its use of sub-groups and the device's list of
sub-group sizes. It is quite rudimentary in its heuristic and doesn't
choose between legal powers of two: the first one is taken as the
preferred list.

I've filed a ticket for clarification/questions about sub-groups and
degenerate sub-groups against the SYCL spec, but I think this is a
suitable "workaround" for now, or even for the long term. The behaviour
is arguably more intuitive for users.
[riscv][vecz] Auto-vectorize sub-group kernels to a device-specific sub-group size
We were alternatively under- and over-setting the alignment on
vectorized loads. Under-setting the alignment is still correct if
sub-optimal.

However, the `SimplifyMaskedMemOpsPass` was ignoring the alignment of
the scalar masked mem load/store builtins when optimizing to regular
LLVM load and store instructions. This would over-set the alignment,
meaning the compiler could assume properties about the address which
were not true.

This led to riscv reporting a misaligned store exception on some SYCL
CTS tests.
[vecz] Fix alignment of load operations
The StrideAnalysisResult had its own local copy of an AssumptionCache.

These were still live at the time we deleted instructions at the end of
packetization, and/or tore down the vectorization process upon
packetization failure.

This meant that some llvm Values could have callback handles in two
assumption caches, and when one was deleted the other would hold onto
garbage memory.

The fix is to use the same assumption cache in StrideAnalysisResult as
is obtained from the FunctionAnalysisManager.

An additional bug was that we were ostensibly invalidating all function
analyses on deleted vectorized functions, but were accidentally
inverting the set of preserved analyses such that all analyses were
being *preserved*, rather than invalidated.
[vecz] Fix invalid dangling AssumptionCaches
This completes the set of work-group collective functions we can support
via the 'alternative' pathway through the vectorizer and the
work-item-loops pass. They join broadcasts and reductions, which were
added a while back.

The scans are implemented using a linear work-item loop, accumulating
with each iteration of the 'X' loop. The vectorizer vectorizes them like
the sub-group scans, so the accumulator used is just the
current/previous value of scan operation so far. The vectorized
work-group merges this with the partial scan over the vector group.

The 'scans-only' parameter to the replace-wgc pass has been removed. The
pass is either run or it isn't run, depending on what the target wants
to do to support the work-group collective operations. Only the RefSi G1
WIPT example uses the old pass, though it should still be tested in LIT.

In unofficial local testing, these scans appear to be around 10% faster
to complete the associated SYCL-CTS tests. This requires more
investigation, however.
[compiler] Add work-group scan support in vecz/work-item-loops
The inlining for memcpy was incorrectly using the destination
parameter's alignment rather than the source parameter, resulting
in miscompiles.

As well as fixing this issue, a test has been added.
[vecz] Use correct alignment for memcpy source

The inlining for memcpy was incorrectly using the destination
parameter's alignment rather than the source parameter, resulting
in miscompiles.

As well as fixing this issue, a test has been added.
These builtins function as barriers, so masking them produces
undesirable results when laying out barrier regions. For this reason, we
don't mask 'regular' barriers.

These builtins are, in any case, uniform/convergent, so either all
work-items or no work-items should reach that point of execution.

Also ensure the spir-v reduction wrapper is marked as convergent, for
completeness' sake.
[vecz] Don't mask work-group collective operations
This extends fixed-width vectorization capabilities to
`__mux_sub_group_shuffle` builtins, but only those with uniform indices
(where the shuffle index is the same for all invocations in the
sub-group). This accounts for the majority of those tested by the
SYCL-CTS. Support for varying indices will come down the line, once the
other shuffles are covered under similar conditions.

The existing sub-group LIT tests have been split by operation, as they
are expected to grow significantly to cover all of the different
conditions we can vectorize under.
This extends fixed-width vectorization capabilities to
`__mux_sub_group_shuffle_xor` builtins. This isn't something that is
very efficiently vectorized, because of all of the runtime indexing,
which no built-in LLVM instructions/intrinsics can really make use of.

It might be preferable for some targets to go through memory. We might
want to make that a codegen option in a future update.
[vecz] Packetize sub-group shuffle_xor builtins
Forgetting to normalize the broadcast ID back into the 'mux' range would
produce incorrect results on an implementation where the mux sub-groups
were non-trivial, or where bounds were checked. This is, as such, not
currently a visible problem in the default implementation of these
builtins, but rather a theoretical one.
@coldav coldav force-pushed the colin/add_ock_into_dpcpp branch from 5f8574c to 729e7b5 Compare August 27, 2025 08:35
@coldav coldav requested a review from a team as a code owner August 27, 2025 08:35
We were generating undef in many places because in older versions of
LLVM, poison did not yet exist. As we no longer support those older
versions of LLVM, and current LLVM is deprecating undef, switch over to
poison.
hvdijk and others added 6 commits August 27, 2025 12:04
[NFC] Update to clang-format/clang-tidy 20.
We were using 'and' to filter branch conditions, trusting that if an
entry mask had it set to false, it would assuredly be kept false. This
does not work when a branch condition is computed as 'poison', as 'false
and poison' evaluates to 'poison', not 'false'.

We can generate 'select' instructions instead, which avoid this problem.
Further updates to account for the deprecation of undef: this updates
documentation and variable names in line with the previous update to use
poison more. It updates tests to no longer use undef where we would no
longer generate undef. It restores one use of UndefValue::get where the
replacement with PoisonValue::get was wrong: SPIR-V's OpUndef does not
allow treating it as poison.
This brings the native cpu compiler pipeline files directly under the ownership
of intel/llvm. This removes the direct dependence on the oneAPI Construction
Kit, although the history of those files still exists under intel/llvm and the
originals still exist at
https://github.com/uxlfoundation/oneapi-construction-kit.

This is the post merge update to the the oneAPI Construction Kit move of the compiler
pipeline files for Native CPU and replaces the original FetchContent method.

This updates the CMakeLists.txt to be appropriate for llvm components,
rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the
.h and .cpp files brought in. Other than that the .cpp and .h files are unchanged.

A brief explanation and limitations of the importing of the files are under l
lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst.

It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants