-
Notifications
You must be signed in to change notification settings - Fork 798
DRAFT: NOT FOR REVIEW [NATIVE_CPU][SYCL] Switch to using native_cpu compiler pipeline inline instead of OCK fetchContent #19886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
coldav
wants to merge
316
commits into
intel:sycl
Choose a base branch
from
coldav:colin/add_ock_into_dpcpp
base: sycl
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[compiler] Rename HandleBarriersPass -> WorkItemLoopsPass
These mirror the corresponding builtins in Intel's SPV_INTEL_subgroups SPIR-V extension. This commit also provides support for lowering to these via the SPIR-V SubgroupShuffleINTEL capability, as part of the SPV_INTEL_subgroups extension. These builtins are currently not vectorized and thus only implemented for 'trivial' sub-groups of size 1. Support for wider sub-groups will come in later commits.
[compiler] Add mux sub-group shuffle builtins
This updates users of the vectorizer to set the desired vectorization factor according to the kernel's required sub-group size, if present. It does so using a common method, to ensure consistent behaviour.This helper method assumes that the current underlying 'mux' sub-group size is 1, as we haven't got any in-tree targets that make use of anything larger than that. All targets consider the required sub-group size less than they do `optnone` or `-cl-wfv=never` options. This can be revisited in the future if users desire different behaviour.
[compiler] Vectorize to any required sub-group size
These were missed when the pass was renamed.
This fixes a long-standing hack in the vectorizer where it would strip the vectorized kernels of their 'kernel' status. This wasn't anything to do with vectorization but was a legacy behaviour owing to the fact that the vectorizer used to replace the old functions. The `WorkItemLoopsPass` would in turn wrap every kernel it came across, and we didn't want it wrapping up the old 'scalar' kernel since we'd never call it: this would negatively impact compile time for no reason. However, with the revised sub-groups model and to handle the case where the target is not using degenerate sub-groups, we *do* want to wrap up the unvectorized scalar kernel as it is useful as a fallback in the same way the degenerate sub-group kernel is. To avoid compiling unnecessary kernel wrappers, the WorkItemLoopsPass has been taught to avoid wrapping scalar kernels if they don't use sub-group builtins, or if they're degenerate, or if the kernel has a required sub-group size. This should preserve the old behaviour in the vast majority of cases. The `VerifyReqdSubGroupSizeSatisfiedPass` must now be run later in the pipeline, after the `WorkItemLoopsPass` has run. This is because immediately after vectorization the old 'scalar' kernel appears as though the required sub-group size has not been met. It is only once the `WorkItemLoopsPass` has run and vectorized and unvectorized kernels have been merged that we have our 'final' kernel forms, and we can glean whether the compiler correctly satisfied the constraints. We could alternatively make the pass look at all the vectorized forms of each kernel and verify that *at least one* form of each kernel has met the requirement, but this ties its behaviour more closely with the vectorizer's output, whereas currently it's fairly agnostic: it runs over all kernel entry points.
This changes how the riscv/refsi targets handle sub-group sizes. The riscv/refsi targets now automatically vectorize kernels/functions containing explicit use of sub-groups to one of the device's advertised sub-group widths. These targets no longer run the `DegenerateSubGroupPass`! The rationale for this is as follows: The SYCL 2020 spec has `info::device::sub_group_sizes` which returns "the" list of sub-group sizes reported by the device. The CTS expects that if it runs a kernel which (for example) returns the kernel's sub-group size to the host, then that size is found in that same device list. This rules out the use of degenerate sub-groups. We'd have to list all the sub-group (read: work-group) sizes from 1 to 1024, which is useless for users. All of those sub-group sizes could then, in turn, be required by a user, and the compiler would effectively be unable to satisfy. A user requiring a sub-group size of 244 would lead to a very strange situation where the compiler must create a degenerate sub-groups kernel and would then just hope that the user uses a work-group size of 244. Anything else would be invalid. This is another bad user experience. The vectorizer can be asked to return a suitable vectorization factor for a kernel, given its use of sub-groups and the device's list of sub-group sizes. It is quite rudimentary in its heuristic and doesn't choose between legal powers of two: the first one is taken as the preferred list. I've filed a ticket for clarification/questions about sub-groups and degenerate sub-groups against the SYCL spec, but I think this is a suitable "workaround" for now, or even for the long term. The behaviour is arguably more intuitive for users.
[riscv][vecz] Auto-vectorize sub-group kernels to a device-specific sub-group size
We were alternatively under- and over-setting the alignment on vectorized loads. Under-setting the alignment is still correct if sub-optimal. However, the `SimplifyMaskedMemOpsPass` was ignoring the alignment of the scalar masked mem load/store builtins when optimizing to regular LLVM load and store instructions. This would over-set the alignment, meaning the compiler could assume properties about the address which were not true. This led to riscv reporting a misaligned store exception on some SYCL CTS tests.
[vecz] Fix alignment of load operations
The StrideAnalysisResult had its own local copy of an AssumptionCache. These were still live at the time we deleted instructions at the end of packetization, and/or tore down the vectorization process upon packetization failure. This meant that some llvm Values could have callback handles in two assumption caches, and when one was deleted the other would hold onto garbage memory. The fix is to use the same assumption cache in StrideAnalysisResult as is obtained from the FunctionAnalysisManager. An additional bug was that we were ostensibly invalidating all function analyses on deleted vectorized functions, but were accidentally inverting the set of preserved analyses such that all analyses were being *preserved*, rather than invalidated.
[vecz] Fix invalid dangling AssumptionCaches
This completes the set of work-group collective functions we can support via the 'alternative' pathway through the vectorizer and the work-item-loops pass. They join broadcasts and reductions, which were added a while back. The scans are implemented using a linear work-item loop, accumulating with each iteration of the 'X' loop. The vectorizer vectorizes them like the sub-group scans, so the accumulator used is just the current/previous value of scan operation so far. The vectorized work-group merges this with the partial scan over the vector group. The 'scans-only' parameter to the replace-wgc pass has been removed. The pass is either run or it isn't run, depending on what the target wants to do to support the work-group collective operations. Only the RefSi G1 WIPT example uses the old pass, though it should still be tested in LIT. In unofficial local testing, these scans appear to be around 10% faster to complete the associated SYCL-CTS tests. This requires more investigation, however.
[compiler] Add work-group scan support in vecz/work-item-loops
The inlining for memcpy was incorrectly using the destination parameter's alignment rather than the source parameter, resulting in miscompiles. As well as fixing this issue, a test has been added.
[vecz] Use correct alignment for memcpy source The inlining for memcpy was incorrectly using the destination parameter's alignment rather than the source parameter, resulting in miscompiles. As well as fixing this issue, a test has been added.
These builtins function as barriers, so masking them produces undesirable results when laying out barrier regions. For this reason, we don't mask 'regular' barriers. These builtins are, in any case, uniform/convergent, so either all work-items or no work-items should reach that point of execution. Also ensure the spir-v reduction wrapper is marked as convergent, for completeness' sake.
[vecz] Don't mask work-group collective operations
This extends fixed-width vectorization capabilities to `__mux_sub_group_shuffle` builtins, but only those with uniform indices (where the shuffle index is the same for all invocations in the sub-group). This accounts for the majority of those tested by the SYCL-CTS. Support for varying indices will come down the line, once the other shuffles are covered under similar conditions. The existing sub-group LIT tests have been split by operation, as they are expected to grow significantly to cover all of the different conditions we can vectorize under.
This extends fixed-width vectorization capabilities to `__mux_sub_group_shuffle_xor` builtins. This isn't something that is very efficiently vectorized, because of all of the runtime indexing, which no built-in LLVM instructions/intrinsics can really make use of. It might be preferable for some targets to go through memory. We might want to make that a codegen option in a future update.
[vecz] Packetize sub-group shuffle_xor builtins
Forgetting to normalize the broadcast ID back into the 'mux' range would produce incorrect results on an implementation where the mux sub-groups were non-trivial, or where bounds were checked. This is, as such, not currently a visible problem in the default implementation of these builtins, but rather a theoretical one.
5f8574c
to
729e7b5
Compare
We were generating undef in many places because in older versions of LLVM, poison did not yet exist. As we no longer support those older versions of LLVM, and current LLVM is deprecating undef, switch over to poison.
[NFC] Update to clang-format/clang-tidy 20.
We were using 'and' to filter branch conditions, trusting that if an entry mask had it set to false, it would assuredly be kept false. This does not work when a branch condition is computed as 'poison', as 'false and poison' evaluates to 'poison', not 'false'. We can generate 'select' instructions instead, which avoid this problem.
Further updates to account for the deprecation of undef: this updates documentation and variable names in line with the previous update to use poison more. It updates tests to no longer use undef where we would no longer generate undef. It restores one use of UndefValue::get where the replacement with PoisonValue::get was wrong: SPIR-V's OpUndef does not allow treating it as poison.
More poison/undef updates.
729e7b5
to
63d7eb8
Compare
This brings the native cpu compiler pipeline files directly under the ownership of intel/llvm. This removes the direct dependence on the oneAPI Construction Kit, although the history of those files still exists under intel/llvm and the originals still exist at https://github.com/uxlfoundation/oneapi-construction-kit. This is the post merge update to the the oneAPI Construction Kit move of the compiler pipeline files for Native CPU and replaces the original FetchContent method. This updates the CMakeLists.txt to be appropriate for llvm components, rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the .h and .cpp files brought in. Other than that the .cpp and .h files are unchanged. A brief explanation and limitations of the importing of the files are under l lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst. It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.
63d7eb8
to
0b99891
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This brings the native cpu compiler pipeline files directly under the ownership
of intel/llvm. This removes the direct dependence on the oneAPI Construction
Kit, although the history of those files still exists under intel/llvm and the
originals still exist at
https://github.com/uxlfoundation/oneapi-construction-kit.
This updates the CMakeLists.txt to be appropriate for llvm components,
rather in the style of oneAPI Construction Kit. It also applies clang-format-20 to all the
.h and .cpp files brought in. Other than that the .cpp and .h files are unchanged.
A brief explanation and limitations of the importing of the files are under
l lvm/lib/SYCLNativeCPUUtils/compiler_passes/compiler_passes.rst
.It also updates the NATIVECPU_USE_OCK to be on in sycl-linux-build.yml.