forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 74
merge amd-staging into amd-feature/wave-transform #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
cdevadas
merged 973 commits into
amd-feature/wave-transform
from
amd/wave-transform/merge-oct22
Oct 23, 2025
Merged
merge amd-staging into amd-feature/wave-transform #348
cdevadas
merged 973 commits into
amd-feature/wave-transform
from
amd/wave-transform/merge-oct22
Oct 23, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…vm#161637) The check used was not strong enough to prevent the insertion of sample/bvhcnt when they were not needed. I assume SIInsertWaitCnts was trimming those away anyway, but this was a bug nonetheless. We were inserting SAMPLE/BVHcnt waits in places where we only needed to wait on the previous atomic operation. Neither of these counter have any atomics associated with them.
When the legacy cost model scalarizes loads that are used as addresses for other loads and stores, it looks to phi nodes, if they are direct address operands of loads/stores. Match this behavior in isUsedByLoadStoreAddress, to fix a divergence between legacy and VPlan-based cost model.
No tests seem to need them here or downstream.
…lvm#163570) Splats include two poison values, but only the poison-ness of the splatted value actually matters.
A single-scalar replicate without side-effects, and with uniform operands, is uniform. Special-case assumes and stores.
…4051) There are cases where atomic constraints are independent of template parameters, yet we still have a template parameter mapping. We don't bother translating template arguments for them. Note that we retain an empty parameter mapping rather than none at all, as the former may improve cache hit rates (We don't profile MLTAL but profile the empty template argument list instead.) This is a regression on trunk, so there's no release note.
When serializing SPIR-V MLIR containing externally linked function with debug enabled, the serialization crashes as `printValueIDMap` tries to print a key value that has been already destroyed. This happen as for externally linked function the body of the function is erased, that causes arguments to be destroyed as well, but the valueIDMap was never updated.
…vm#164198) createWidenCast doesn't set the flag type, so when we simplify trunc (zext nneg x) -> zext x we would hit an assertion in CSE that the flag types don't match with other VPWidenCastRecipes that weren't simplified. This fixes it the same way trunc flags are handled too. As an aside I think it should be correct to preserve the nneg flag in this case since the input operand is still non-negative after the transform. But that's left to another PR. Fixes llvm#164171
Recipes in replicate regions implicitly depend on the region's predicate. Limit CSE to recipes in the same block, when either recipe is in a replicate region. This allows handling VPPredInstPHIRecipe during CSE. If we perform CSE on recipes inside a replicate region, we may end up with 2 VPPredInstPHIRecipes sharing the same operand. This is incompatible with current VPPredInstPHIRecipe codegen, which re-sets the current value of its operand in VPTransformState. This can cause crashes in the added test cases. Note that this patch only modifies ::isEqual to check for replicating regions and not getHash, as CSE across replicating regions should be uncommon. Fixes llvm#157314. Fixes llvm#161974. PR: llvm#162110
…64204) One iteration of this loop might've already fixed up the pointers of coming globals, so check for that explicitly. Fixes llvm#164151
- typos - use insert instead of back inserter to make allocations more efficient - make sure the constraint cache uses canonicalized keys to avoid redundant work
I was a bit too eager to remove the SI_WHOLE_WAVE_FUNC_SETUP instruction during prolog emission. Erasing it invalidates MBBI, which in some cases is still needed outside of `emitCSRSpillStores`. Do the erasing at the end of prolog insertion instead.
Part of llvm#102817. This patch attempts to optimize the performance of `std::generate` for segmented iterators. Below are the benchmark numbers from `libcxx\test\benchmarks\algorithms\modifying\generate.bench.cpp`. Test cases that use segmented iterators have also been added. - before ``` std::generate(deque<int>)/32 194 ns 193 ns 3733333 std::generate(deque<int>)/50 276 ns 276 ns 2488889 std::generate(deque<int>)/1024 5096 ns 5022 ns 112000 std::generate(deque<int>)/8192 40806 ns 40806 ns 17231 ``` - after ``` std::generate(deque<int>)/32 106 ns 105 ns 6400000 std::generate(deque<int>)/50 139 ns 138 ns 4977778 std::generate(deque<int>)/1024 2713 ns 2699 ns 248889 std::generate(deque<int>)/8192 18983 ns 19252 ns 37333 ``` --------- Co-authored-by: A. Jiang <[email protected]>
…nteger registers (llvm#163646) Fix the `RegisterValue::SetValueFromData` method so that it works also for 128-bit registers that contain integers. Without this change, the `RegisterValue::SetValueFromData` method does not work correctly for 128-bit registers that contain (signed or unsigned) integers. --- Steps to reproduce the problem: (1) Create a program that writes a 128-bit number to a 128-bit registers `xmm0`. E.g.: ``` #include <stdint.h> int main() { __asm__ volatile ( "pinsrq $0, %[lo], %%xmm0\n\t" // insert low 64 bits "pinsrq $1, %[hi], %%xmm0" // insert high 64 bits : : [lo]"r"(0x7766554433221100), [hi]"r"(0xffeeddccbbaa9988) ); return 0; } ``` (2) Compile this program with LLVM compiler: ``` $ $YOUR/clang -g -o main main.c ``` (3) Modify LLDB so that when it will be reading value from the `xmm0` register, instead of assuming that it is vector register, it will treat it as if it contain an integer. This can be achieved e.g. this way: ``` diff --git a/lldb/source/Utility/RegisterValue.cpp b/lldb/source/Utility/RegisterValue.cpp index 0e99451..a4b51db3e56d 100644 --- a/lldb/source/Utility/RegisterValue.cpp +++ b/lldb/source/Utility/RegisterValue.cpp @@ -188,6 +188,7 @@ Status RegisterValue::SetValueFromData(const RegisterInfo ®_info, break; case eEncodingUint: case eEncodingSint: + case eEncodingVector: if (reg_info.byte_size == 1) SetUInt8(src.GetMaxU32(&src_offset, src_len)); else if (reg_info.byte_size <= 2) @@ -217,23 +218,6 @@ Status RegisterValue::SetValueFromData(const RegisterInfo ®_info, else if (reg_info.byte_size == sizeof(long double)) SetLongDouble(src.GetLongDouble(&src_offset)); break; - case eEncodingVector: { - m_type = eTypeBytes; - assert(reg_info.byte_size <= kMaxRegisterByteSize); - buffer.bytes.resize(reg_info.byte_size); - buffer.byte_order = src.GetByteOrder(); - if (src.CopyByteOrderedData( - src_offset, // offset within "src" to start extracting data - src_len, // src length - buffer.bytes.data(), // dst buffer - buffer.bytes.size(), // dst length - buffer.byte_order) == 0) // dst byte order - { - error = Status::FromErrorStringWithFormat( - "failed to copy data for register write of %s", reg_info.name); - return error; - } - } } if (m_type == eTypeInvalid) ``` (4) Rebuild the LLDB. (5) Observe what happens how LLDB will print the content of this register after it was initialized with 128-bit value. ``` $YOUR/lldb --source ./main (lldb) target create main Current executable set to '.../main' (x86_64). (lldb) breakpoint set --file main.c --line 11 Breakpoint 1: where = main`main + 45 at main.c:11:3, address = 0x000000000000164d (lldb) settings set stop-line-count-before 20 (lldb) process launch Process 2568735 launched: '.../main' (x86_64) Process 2568735 stopped * thread #1, name = 'main', stop reason = breakpoint 1.1 frame #0: 0x000055555555564d main`main at main.c:11:3 1 #include <stdint.h> 2 3 int main() { 4 __asm__ volatile ( 5 "pinsrq $0, %[lo], %%xmm0\n\t" // insert low 64 bits 6 "pinsrq $1, %[hi], %%xmm0" // insert high 64 bits 7 : 8 : [lo]"r"(0x7766554433221100), 9 [hi]"r"(0xffeeddccbbaa9988) 10 ); -> 11 return 0; 12 } (lldb) register read --format hex xmm0 xmm0 = 0x7766554433221100ffeeddccbbaa9988 ``` You can see that the upper and lower 64-bit wide halves are swapped. --------- Co-authored-by: Matej Košík <[email protected]>
This patch pivots GPR32 and GPR64 zeroing into distinct branches to simplify the code an improve the lowering. Zeroing GPR moves are now handled differently than non-zeroing ones. Zero source registers WZR and XZR do not require register annotations of undef, implicit and kill. The non-zeroing source now cannot process WZR removing the ternary expression. This patch also moves GPR64 logic right after GPR32 for better organization.
…m#164071) Add documentation for the no-rollback conversion driver. Also improve the documentation of the old rollback driver. In particular: which modifications are performed immediately and which are delayed.
Handle ptrtoaddr the same way as ptrtoint. The fold already only operates on the index/address bits.
If a main instruction in the copyables is a div-like instruction, the compiler cannot pack duplicates, extending with poisons, these instructions, being vectorize, will result in undefined behavior. Fixes llvm#164185
`UnqualPtrTy` didn't always match `llvm::PointerType::getUnqual`: sometimes it returned a pointer that is not in address space 0 (notably for SPIRV). Since `UnqualPtrTy` was used as the "generic" or "default" pointer type, this patch renames it to `DefaultPtrTy` to avoid confusion with LLVM's `PointerType::getUnqual`.
All the existing tests test code either in ConstantFolding or InstSimplify, so move them to use -passes=instsimplify instead of -passes=instcombine. This makes sure we keep InstSimplify coverage even if there are subsuming InstCombine folds. This requires writing some of the constant folding tests in a different way, as InstSimplify does not try to re-fold already existing constant expressions.
This reverts commit 1943c9e. This took out quite a few buildbots. Some of the Z3 test cases are failing and enabling this is causing some LLVM tests to begin failing.
Add parsing and semantic checks for DEVICE_SAFESYNC clause. No lowering.
This PR fixes a crash in the `bf_getbuffer` implementation of `PyDenseElementsAttribute` that occurred when an element type was not supported, such as `bf16`. I believe that supportion `bf16` is not possible with that protocol but that's out of the scope of this PR. Previsouly, the code raised an `std::exception` out of `bf_getbuffer` that nanobind does not catch (see also pybind/pybind11#3336). The PR makes the function catch all `std::exception`s and manually raises a Python exception instead. Signed-off-by: Ingo Müller <[email protected]>
Add test with urem guard with non-constant divisor and AddRec guards. Extra test coverage for llvm#163021
OpenACC 3.4 includes the ability to add an 'if' to an atomic operation. From the change log: `Added the if clause to the atomic construct to enable conditional atomic operations based867 on the parallelism strategy employed` In 2.12, the C/C++ grammar is changed to say: `#pragma acc atomic [ atomic-clause ] [ if( condition ) ] new-line` With corresponding changes to the Fortran standard This patch adds support to this for the dialect, so that Clang can use it soon.
…es (llvm#163972) The lowering of `!$acc loop` loops with an early exit currently ends-up "duplicating" the control flow in the acc.loop and inside it as explicit control flow (as if each iteration executes each iteration until the early exit). Add a TODO for now.
…lvm#164371) Was browsing through this and the do/while loop (in addition to the local `ret` counter which only ever gets incremented at most once) were hard to reason about imo. This patch removes both in favour of early-returns.
The `ClangDeclVendor` used to contain more Clang-specific code than it does nowadays. But at this point, all it does is wrap the `DeclVendor::FindDecls` call and copy the resulting decls into `std::vector<clang::NamedDecl*>`. I.e., it converts the generic `CompilerDecl`s to `clang::NamedDecl*`s. In my opinion at this point it doesn't do enough to justify making it part of the `DeclVendor` hierarchy. This patch removes the `ClangDeclVendor` and instead does the conversion at callsite.
…onals (llvm#164342) If a select instruction is replaced with one whose conditional is the negation of the original, then the replacement's branch weights are the reverse of the original's. Tracking issue: llvm#147390
Fix MSAN failure and expensive test failure.
Implement MIR2Vec embedder for generating vector representations of Machine IR instructions, basic blocks, and functions. This patch introduces changes necessary to *embed* machine opcodes. Machine operands would be handled incrementally in the upcoming patches.
Reverts llvm#164321 Align behavior with other CUDA Compiler
) Based on the double precision's sin/cos fast path algorithm: Step 1: Perform range reduction `y = x mod pi/8` with target errors < 2^-54. This is because the worst case mod pi/8 for single precision is ~2^-31, so to have up to 1 ULP errors from the range reduction, the targeted errors should `be 2^(-31 - 23) = 2^-54`. Step 2: Polynomial approximation We use degree-5 and degree-4 polynomials to approximate sin and cos of the reduced angle respectively. Step 3: Combine the results using trig identities ```math \begin{align*} \sin(x) &= \sin(y) \cdot \cos(k \cdot \frac{\pi}{8}) + \cos(y) \cdot \sin(k \cdot \frac{\pi}{8}) \\ \cos(x) &= \cos(y) \cdot \cos(k \cdot \frac{\pi}{8}) - \sin(y) \cdot \sin(k \cdot \frac{\pi}{8}) \end{align*} ``` Overall errors: <= 3 ULPs for default rounding modes (tested exhaustively). Current limitation: large range reduction requires FMA instructions for binary32. This restriction will be removed in the followup PR. --------- Co-authored-by: Petr Hosek <[email protected]>
…lvm#164455) Add attributes to the unit tests required to pass `spirv-val`. Addresses llvm#161852
) Create a POSIX `<nl_types.h>` header with `catopen`, `catclose`, and `catgets` function declarations. Provide the stub/placeholder implementations which always return error. This is consistent with the way locales are currently (un-)implemented in llvm-libc. Notably, providing `<nl_types.h>` fixes the last remaining issue with building libc++ against llvm-libc (on certain configuration of x86_64 Linux) after disabling threads and wide-characters in libc++.
Upstream the basic support for the C++ try catch statement with a try block that doesn't contain any call instructions and with a catch-all statement Issue llvm#154992
With llvm#163862, this is not really necessary and causes downstream issues.
llvm#162332) Originally llvm#161912, we've now decided that an explicit GPL notification is redundant with the LICENSE file, which is a common convention for relaying this information. Co-authored-by: Cameron McInally <[email protected]>
…lvm#164346) llvm#140443 makes use of the CMake variable `Python3_EXECUTABLE_DEBUG`, which was introduced in CMake version 3.30. On systems with an inferior version of cmake, the lit tests will try to run with an empty `config.python_executable`. This PR adds a warning and falls back to using `Python3_EXECUTABLE` if the CMake version is less than `3.30`.
Fixes test failure issues (caused by llvm#162161) in Windows buildbots.
This introduces the support for 32-bit ARM Fuchsia target which uses the aapcs-linux ABI defaulting to thumbv8a as the target.
Implementation files using the Intel syntax explicitly specify it. Do the same for the few files using AT&T syntax. This also enables building LLVM with `-mllvm -x86-asm-syntax=intel` in one's Clang config files (i.e. a global preference for Intel syntax). No functional change intended.
…IL target (llvm#164472) This is a temporary measure to explicitly remove the unrecognized named metadata when targeting DXIL. This should be removed for an allowlist as tracked here: llvm#164473.
This commit introduces a base-class implementation for a method that
reads memory from multiple ranges at once. This implementation simply
calls the underlying `ReadMemoryFromInferior` method on each requested
range, intentionally bypassing the memory caching mechanism (though this
may be easily changed in the future).
`Process` implementations that can be perform this operation more
efficiently - e.g. with the MultiMemPacket described in [1] - are
expected to override this method.
As an example, this commit changes AppleObjCClassDescriptorV2 to use the
new API.
Note about the API
------------------
In the RFC, we discussed having the API return some kind of class
`ReadMemoryRangesResult`. However, while writing such a class, it became
clear that it was merely wrapping a vector, without providing anything
useful. For example, this class:
```
struct ReadMemoryRangesResult {
ReadMemoryRangesResult(
llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges)
: ranges(std::move(ranges)) {}
llvm::ArrayRef<llvm::MutableArrayRef<uint8_t>> getRanges() const {
return ranges;
}
private:
llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges;
};
```
As can be seen in the added test and in the added use-case
(AppleObjCClassDescriptorV2), users of this API will just iterate over
the vector of memory buffers. So they want a return type that can be
iterated over, and the vector seems more natural than creating a new
class and defining iterators for it.
Likewise, in the RFC, we discussed wrapping the result into an
`Expected`. Upon experimenting with the code, this feels like it limits
what the API is able to do as the base class implementation never needs
to fail the entire result, it's the individual reads that may fail and
this is expressed through a zero-length result. Any derived classes
overriding `ReadMemoryRanges` should also never produce a top level
failure: if they did, they can just fall back to the base class
implementation, which would produce a better result.
The choice of having the caller allocate a buffer and pass it to
`Process::ReadMemoryRanges` is done mostly to follow conventions already
done in the Process class.
[1]:
https://discourse.llvm.org/t/rfc-a-new-vectorized-memory-read-packet/
…NFC) Split off to clarify naming, as suggested in llvm#156262.
PR to (hopefully) help Ron merge the upstream PR into downstream with minmal issues
|
PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/6 |
ranapratap55
approved these changes
Oct 23, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.