merge amd-staging into amd-feature/wave-transform #348

cdevadas · 2025-10-22T17:27:55Z

No description provided.

…vm#161637) The check used was not strong enough to prevent the insertion of sample/bvhcnt when they were not needed. I assume SIInsertWaitCnts was trimming those away anyway, but this was a bug nonetheless. We were inserting SAMPLE/BVHcnt waits in places where we only needed to wait on the previous atomic operation. Neither of these counter have any atomics associated with them.

When the legacy cost model scalarizes loads that are used as addresses for other loads and stores, it looks to phi nodes, if they are direct address operands of loads/stores. Match this behavior in isUsedByLoadStoreAddress, to fix a divergence between legacy and VPlan-based cost model.

No tests seem to need them here or downstream.

…lvm#163570) Splats include two poison values, but only the poison-ness of the splatted value actually matters.

A single-scalar replicate without side-effects, and with uniform operands, is uniform. Special-case assumes and stores.

…4051) There are cases where atomic constraints are independent of template parameters, yet we still have a template parameter mapping. We don't bother translating template arguments for them. Note that we retain an empty parameter mapping rather than none at all, as the former may improve cache hit rates (We don't profile MLTAL but profile the empty template argument list instead.) This is a regression on trunk, so there's no release note.

When serializing SPIR-V MLIR containing externally linked function with debug enabled, the serialization crashes as `printValueIDMap` tries to print a key value that has been already destroyed. This happen as for externally linked function the body of the function is erased, that causes arguments to be destroyed as well, but the valueIDMap was never updated.

…vm#164198) createWidenCast doesn't set the flag type, so when we simplify trunc (zext nneg x) -> zext x we would hit an assertion in CSE that the flag types don't match with other VPWidenCastRecipes that weren't simplified. This fixes it the same way trunc flags are handled too. As an aside I think it should be correct to preserve the nneg flag in this case since the input operand is still non-negative after the transform. But that's left to another PR. Fixes llvm#164171

Recipes in replicate regions implicitly depend on the region's predicate. Limit CSE to recipes in the same block, when either recipe is in a replicate region. This allows handling VPPredInstPHIRecipe during CSE. If we perform CSE on recipes inside a replicate region, we may end up with 2 VPPredInstPHIRecipes sharing the same operand. This is incompatible with current VPPredInstPHIRecipe codegen, which re-sets the current value of its operand in VPTransformState. This can cause crashes in the added test cases. Note that this patch only modifies ::isEqual to check for replicating regions and not getHash, as CSE across replicating regions should be uncommon. Fixes llvm#157314. Fixes llvm#161974. PR: llvm#162110

…64204) One iteration of this loop might've already fixed up the pointers of coming globals, so check for that explicitly. Fixes llvm#164151

- typos - use insert instead of back inserter to make allocations more efficient - make sure the constraint cache uses canonicalized keys to avoid redundant work

I was a bit too eager to remove the SI_WHOLE_WAVE_FUNC_SETUP instruction during prolog emission. Erasing it invalidates MBBI, which in some cases is still needed outside of `emitCSRSpillStores`. Do the erasing at the end of prolog insertion instead.

Part of llvm#102817. This patch attempts to optimize the performance of `std::generate` for segmented iterators. Below are the benchmark numbers from `libcxx\test\benchmarks\algorithms\modifying\generate.bench.cpp`. Test cases that use segmented iterators have also been added. - before ``` std::generate(deque<int>)/32 194 ns 193 ns 3733333 std::generate(deque<int>)/50 276 ns 276 ns 2488889 std::generate(deque<int>)/1024 5096 ns 5022 ns 112000 std::generate(deque<int>)/8192 40806 ns 40806 ns 17231 ``` - after ``` std::generate(deque<int>)/32 106 ns 105 ns 6400000 std::generate(deque<int>)/50 139 ns 138 ns 4977778 std::generate(deque<int>)/1024 2713 ns 2699 ns 248889 std::generate(deque<int>)/8192 18983 ns 19252 ns 37333 ``` --------- Co-authored-by: A. Jiang <[email protected]>

…nteger registers (llvm#163646) Fix the `RegisterValue::SetValueFromData` method so that it works also for 128-bit registers that contain integers. Without this change, the `RegisterValue::SetValueFromData` method does not work correctly for 128-bit registers that contain (signed or unsigned) integers. --- Steps to reproduce the problem: (1) Create a program that writes a 128-bit number to a 128-bit registers `xmm0`. E.g.: ``` #include <stdint.h> int main() { __asm__ volatile ( "pinsrq $0, %[lo], %%xmm0\n\t" // insert low 64 bits "pinsrq $1, %[hi], %%xmm0" // insert high 64 bits : : [lo]"r"(0x7766554433221100), [hi]"r"(0xffeeddccbbaa9988) ); return 0; } ``` (2) Compile this program with LLVM compiler: ``` $ $YOUR/clang -g -o main main.c ``` (3) Modify LLDB so that when it will be reading value from the `xmm0` register, instead of assuming that it is vector register, it will treat it as if it contain an integer. This can be achieved e.g. this way: ``` diff --git a/lldb/source/Utility/RegisterValue.cpp b/lldb/source/Utility/RegisterValue.cpp index 0e99451..a4b51db3e56d 100644 --- a/lldb/source/Utility/RegisterValue.cpp +++ b/lldb/source/Utility/RegisterValue.cpp @@ -188,6 +188,7 @@ Status RegisterValue::SetValueFromData(const RegisterInfo &reg_info, break; case eEncodingUint: case eEncodingSint: + case eEncodingVector: if (reg_info.byte_size == 1) SetUInt8(src.GetMaxU32(&src_offset, src_len)); else if (reg_info.byte_size <= 2) @@ -217,23 +218,6 @@ Status RegisterValue::SetValueFromData(const RegisterInfo &reg_info, else if (reg_info.byte_size == sizeof(long double)) SetLongDouble(src.GetLongDouble(&src_offset)); break; - case eEncodingVector: { - m_type = eTypeBytes; - assert(reg_info.byte_size <= kMaxRegisterByteSize); - buffer.bytes.resize(reg_info.byte_size); - buffer.byte_order = src.GetByteOrder(); - if (src.CopyByteOrderedData( - src_offset, // offset within "src" to start extracting data - src_len, // src length - buffer.bytes.data(), // dst buffer - buffer.bytes.size(), // dst length - buffer.byte_order) == 0) // dst byte order - { - error = Status::FromErrorStringWithFormat( - "failed to copy data for register write of %s", reg_info.name); - return error; - } - } } if (m_type == eTypeInvalid) ``` (4) Rebuild the LLDB. (5) Observe what happens how LLDB will print the content of this register after it was initialized with 128-bit value. ``` $YOUR/lldb --source ./main (lldb) target create main Current executable set to '.../main' (x86_64). (lldb) breakpoint set --file main.c --line 11 Breakpoint 1: where = main`main + 45 at main.c:11:3, address = 0x000000000000164d (lldb) settings set stop-line-count-before 20 (lldb) process launch Process 2568735 launched: '.../main' (x86_64) Process 2568735 stopped * thread #1, name = 'main', stop reason = breakpoint 1.1 frame #0: 0x000055555555564d main`main at main.c:11:3 1 #include <stdint.h> 2 3 int main() { 4 __asm__ volatile ( 5 "pinsrq $0, %[lo], %%xmm0\n\t" // insert low 64 bits 6 "pinsrq $1, %[hi], %%xmm0" // insert high 64 bits 7 : 8 : [lo]"r"(0x7766554433221100), 9 [hi]"r"(0xffeeddccbbaa9988) 10 ); -> 11 return 0; 12 } (lldb) register read --format hex xmm0 xmm0 = 0x7766554433221100ffeeddccbbaa9988 ``` You can see that the upper and lower 64-bit wide halves are swapped. --------- Co-authored-by: Matej Košík <[email protected]>

This patch pivots GPR32 and GPR64 zeroing into distinct branches to simplify the code an improve the lowering. Zeroing GPR moves are now handled differently than non-zeroing ones. Zero source registers WZR and XZR do not require register annotations of undef, implicit and kill. The non-zeroing source now cannot process WZR removing the ternary expression. This patch also moves GPR64 logic right after GPR32 for better organization.

…m#164071) Add documentation for the no-rollback conversion driver. Also improve the documentation of the old rollback driver. In particular: which modifications are performed immediately and which are delayed.

Handle ptrtoaddr the same way as ptrtoint. The fold already only operates on the index/address bits.

If a main instruction in the copyables is a div-like instruction, the compiler cannot pack duplicates, extending with poisons, these instructions, being vectorize, will result in undefined behavior. Fixes llvm#164185

`UnqualPtrTy` didn't always match `llvm::PointerType::getUnqual`: sometimes it returned a pointer that is not in address space 0 (notably for SPIRV). Since `UnqualPtrTy` was used as the "generic" or "default" pointer type, this patch renames it to `DefaultPtrTy` to avoid confusion with LLVM's `PointerType::getUnqual`.

All the existing tests test code either in ConstantFolding or InstSimplify, so move them to use -passes=instsimplify instead of -passes=instcombine. This makes sure we keep InstSimplify coverage even if there are subsuming InstCombine folds. This requires writing some of the constant folding tests in a different way, as InstSimplify does not try to re-fold already existing constant expressions.

This reverts commit 1943c9e. This took out quite a few buildbots. Some of the Z3 test cases are failing and enabling this is causing some LLVM tests to begin failing.

Add parsing and semantic checks for DEVICE_SAFESYNC clause. No lowering.

This PR fixes a crash in the `bf_getbuffer` implementation of `PyDenseElementsAttribute` that occurred when an element type was not supported, such as `bf16`. I believe that supportion `bf16` is not possible with that protocol but that's out of the scope of this PR. Previsouly, the code raised an `std::exception` out of `bf_getbuffer` that nanobind does not catch (see also pybind/pybind11#3336). The PR makes the function catch all `std::exception`s and manually raises a Python exception instead. Signed-off-by: Ingo Müller <[email protected]>

Add test with urem guard with non-constant divisor and AddRec guards. Extra test coverage for llvm#163021

OpenACC 3.4 includes the ability to add an 'if' to an atomic operation. From the change log: `Added the if clause to the atomic construct to enable conditional atomic operations based867 on the parallelism strategy employed` In 2.12, the C/C++ grammar is changed to say: `#pragma acc atomic [ atomic-clause ] [ if( condition ) ] new-line` With corresponding changes to the Fortran standard This patch adds support to this for the dialect, so that Clang can use it soon.

…es (llvm#163972) The lowering of `!$acc loop` loops with an early exit currently ends-up "duplicating" the control flow in the acc.loop and inside it as explicit control flow (as if each iteration executes each iteration until the early exit). Add a TODO for now.

…lvm#164371) Was browsing through this and the do/while loop (in addition to the local `ret` counter which only ever gets incremented at most once) were hard to reason about imo. This patch removes both in favour of early-returns.

The `ClangDeclVendor` used to contain more Clang-specific code than it does nowadays. But at this point, all it does is wrap the `DeclVendor::FindDecls` call and copy the resulting decls into `std::vector<clang::NamedDecl*>`. I.e., it converts the generic `CompilerDecl`s to `clang::NamedDecl*`s. In my opinion at this point it doesn't do enough to justify making it part of the `DeclVendor` hierarchy. This patch removes the `ClangDeclVendor` and instead does the conversion at callsite.

…onals (llvm#164342) If a select instruction is replaced with one whose conditional is the negation of the original, then the replacement's branch weights are the reverse of the original's. Tracking issue: llvm#147390

Fix MSAN failure and expensive test failure.

Implement MIR2Vec embedder for generating vector representations of Machine IR instructions, basic blocks, and functions. This patch introduces changes necessary to *embed* machine opcodes. Machine operands would be handled incrementally in the upcoming patches.

Reverts llvm#164321 Align behavior with other CUDA Compiler

) Based on the double precision's sin/cos fast path algorithm: Step 1: Perform range reduction `y = x mod pi/8` with target errors < 2^-54. This is because the worst case mod pi/8 for single precision is ~2^-31, so to have up to 1 ULP errors from the range reduction, the targeted errors should `be 2^(-31 - 23) = 2^-54`. Step 2: Polynomial approximation We use degree-5 and degree-4 polynomials to approximate sin and cos of the reduced angle respectively. Step 3: Combine the results using trig identities ```math \begin{align*} \sin(x) &= \sin(y) \cdot \cos(k \cdot \frac{\pi}{8}) + \cos(y) \cdot \sin(k \cdot \frac{\pi}{8}) \\ \cos(x) &= \cos(y) \cdot \cos(k \cdot \frac{\pi}{8}) - \sin(y) \cdot \sin(k \cdot \frac{\pi}{8}) \end{align*} ``` Overall errors: <= 3 ULPs for default rounding modes (tested exhaustively). Current limitation: large range reduction requires FMA instructions for binary32. This restriction will be removed in the followup PR. --------- Co-authored-by: Petr Hosek <[email protected]>

…lvm#164455) Add attributes to the unit tests required to pass `spirv-val`. Addresses llvm#161852

) Create a POSIX `<nl_types.h>` header with `catopen`, `catclose`, and `catgets` function declarations. Provide the stub/placeholder implementations which always return error. This is consistent with the way locales are currently (un-)implemented in llvm-libc. Notably, providing `<nl_types.h>` fixes the last remaining issue with building libc++ against llvm-libc (on certain configuration of x86_64 Linux) after disabling threads and wide-characters in libc++.

Upstream the basic support for the C++ try catch statement with a try block that doesn't contain any call instructions and with a catch-all statement Issue llvm#154992

With llvm#163862, this is not really necessary and causes downstream issues.

llvm#162332) Originally llvm#161912, we've now decided that an explicit GPL notification is redundant with the LICENSE file, which is a common convention for relaying this information. Co-authored-by: Cameron McInally <[email protected]>

…lvm#164346) llvm#140443 makes use of the CMake variable `Python3_EXECUTABLE_DEBUG`, which was introduced in CMake version 3.30. On systems with an inferior version of cmake, the lit tests will try to run with an empty `config.python_executable`. This PR adds a warning and falls back to using `Python3_EXECUTABLE` if the CMake version is less than `3.30`.

Fixes test failure issues (caused by llvm#162161) in Windows buildbots.

This introduces the support for 32-bit ARM Fuchsia target which uses the aapcs-linux ABI defaulting to thumbv8a as the target.

…3332) Partially Fixes llvm#160806

Implementation files using the Intel syntax explicitly specify it. Do the same for the few files using AT&T syntax. This also enables building LLVM with `-mllvm -x86-asm-syntax=intel` in one's Clang config files (i.e. a global preference for Intel syntax). No functional change intended.

…IL target (llvm#164472) This is a temporary measure to explicitly remove the unrecognized named metadata when targeting DXIL. This should be removed for an allowlist as tracked here: llvm#164473.

This commit introduces a base-class implementation for a method that reads memory from multiple ranges at once. This implementation simply calls the underlying `ReadMemoryFromInferior` method on each requested range, intentionally bypassing the memory caching mechanism (though this may be easily changed in the future). `Process` implementations that can be perform this operation more efficiently - e.g. with the MultiMemPacket described in [1] - are expected to override this method. As an example, this commit changes AppleObjCClassDescriptorV2 to use the new API. Note about the API ------------------ In the RFC, we discussed having the API return some kind of class `ReadMemoryRangesResult`. However, while writing such a class, it became clear that it was merely wrapping a vector, without providing anything useful. For example, this class: ``` struct ReadMemoryRangesResult { ReadMemoryRangesResult( llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges) : ranges(std::move(ranges)) {} llvm::ArrayRef<llvm::MutableArrayRef<uint8_t>> getRanges() const { return ranges; } private: llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges; }; ``` As can be seen in the added test and in the added use-case (AppleObjCClassDescriptorV2), users of this API will just iterate over the vector of memory buffers. So they want a return type that can be iterated over, and the vector seems more natural than creating a new class and defining iterators for it. Likewise, in the RFC, we discussed wrapping the result into an `Expected`. Upon experimenting with the code, this feels like it limits what the API is able to do as the base class implementation never needs to fail the entire result, it's the individual reads that may fail and this is expressed through a zero-length result. Any derived classes overriding `ReadMemoryRanges` should also never produce a top level failure: if they did, they can just fall back to the base class implementation, which would produce a better result. The choice of having the caller allocate a buffer and pass it to `Process::ReadMemoryRanges` is done mostly to follow conventions already done in the Process class. [1]: https://discourse.llvm.org/t/rfc-a-new-vectorized-memory-read-packet/

…NFC) Split off to clarify naming, as suggested in llvm#156262.

PR to (hopefully) help Ron merge the upstream PR into downstream with minmal issues

skganesan008 · 2025-10-22T17:28:30Z

PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/6

Pierre-vh and others added 30 commits October 20, 2025 12:03

[AMDGPU] Remove unused True16 patterns. (llvm#163784)

a4dde44

No tests seem to need them here or downstream.

[ValueTracking] Teach isGuaranteedNotToBeUndefOrPoison about splats (l…

5b5eacc

…lvm#163570) Splats include two poison values, but only the poison-ness of the splatted value actually matters.

[VPlan] Handle more replicates in isUniformAcrossVFsAndUFs (llvm#162342)

9bfaf12

A single-scalar replicate without side-effects, and with uniform operands, is uniform. Special-case assumes and stores.

[libc++] Simplify tuple_cat further (llvm#163741)

6d663cd

[clang][bytecode] Fix a crash when redeclaring extern globals (llvm#1…

7b0c51a

…64204) One iteration of this loop might've already fixed up the pointers of coming globals, so check for that explicitly. Fixes llvm#164151

[gn build] Port 6d663cd

33a97c9

merge main into amd-staging

2a016ce

[Clang][NFC] Concepts normalization cleanups (llvm#164229)

ed9c75a

- typos - use insert instead of back inserter to make allocations more efficient - make sure the constraint cache uses canonicalized keys to avoid redundant work

[AArch64][GlobalISel] Add rax1.ll test converage. NFC

324bd15

[mlir][docs] Add documentation for No-rollback Conversion Driver (llv…

565e9fa

…m#164071) Add documentation for the no-rollback conversion driver. Also improve the documentation of the old rollback driver. In particular: which modifications are performed immediately and which are delayed.

[InstSimplify] Support ptrtoaddr in simplifyCastInst()

ee50839

Handle ptrtoaddr the same way as ptrtoint. The fold already only operates on the index/address bits.

[SLP]Do not pack div-like copyable values

154138c

If a main instruction in the copyables is a div-like instruction, the compiler cannot pack duplicates, extending with poisons, these instructions, being vectorize, will result in undefined behavior. Fixes llvm#164185

Revert "Reapply "[Clang] Enable lit internal shell by default""

32de3b9

This reverts commit 1943c9e. This took out quite a few buildbots. Some of the Z3 test cases are failing and enabling this is causing some LLVM tests to begin failing.

[flang][OpenMP] Frontend support for DEVICE_SAFESYNC (llvm#163560)

3590a91

Add parsing and semantic checks for DEVICE_SAFESYNC clause. No lowering.

[SCEV] Add extra test coverage with URem & AddRec guards.

0731f18

Add test with urem guard with non-constant divisor and AddRec guards. Extra test coverage for llvm#163021

Michael137 and others added 24 commits October 21, 2025 18:00

[CAS] Fix test fallouts from llvm#114102 (llvm#164457)

7287816

Fix MSAN failure and expensive test failure.

Revert "[flang][cuda][rt] Canonicalize block size values" (llvm#164460)

4a1ea3e

Reverts llvm#164321 Align behavior with other CUDA Compiler

[SPIR-V] Fix unit tests for dynamic indexing to add validation step. (l…

321a419

…lvm#164455) Add attributes to the unit tests required to pass `spirv-val`. Addresses llvm#161852

[CIR] Upstream Exception CXXTryStmt (llvm#162528)

d019a02

Upstream the basic support for the C++ try catch statement with a try block that doesn't contain any call instructions and with a catch-all statement Issue llvm#154992

[mlir] Partially revert llvm#162903 (llvm#164464)

e1e4154

With llvm#163862, this is not really necessary and causes downstream issues.

[MIR2Vec] Fix to skip tests in MIR2VecEmbeddingTestFixture (llvm#164467)

2219119

Fixes test failure issues (caused by llvm#162161) in Windows buildbots.

[Clang][LLVM] Support for Fuchsia on ARM (llvm#163848)

7b190b7

This introduces the support for 32-bit ARM Fuchsia target which uses the aapcs-linux ABI defaulting to thumbv8a as the target.

[Hexagon] Handle bitcast of i64 -> v64i1 when Hvx is enabled (llvm#16…

297f972

…3332) Partially Fixes llvm#160806

[DirectX] remove unrecognized 'llvm.errno.tbaa' named metadata for DX…

bcf7267

…IL target (llvm#164472) This is a temporary measure to explicitly remove the unrecognized named metadata when targeting DXIL. This should be removed for an allowlist as tracked here: llvm#164473.

[VPlan] Clarify naming for helpers to create loop&replicate regions (…

82b5934

…NFC) Split off to clarify naming, as suggested in llvm#156262.

[OpenMP][Flang] PR 164043 reference merge PR (#333)

65f253d

PR to (hopefully) help Ron merge the upstream PR into downstream with minmal issues

merge main into amd-staging

084ec38

merge main into amd-staging (#336)

20410c0

merge amd-staging into amd-feature/wave-transform

97518bf

cdevadas requested review from jmmartinez and ranapratap55 October 22, 2025 17:27

ranapratap55 approved these changes Oct 23, 2025

View reviewed changes

cdevadas merged commit 90d1195 into amd-feature/wave-transform Oct 23, 2025
14 checks passed

cdevadas deleted the amd/wave-transform/merge-oct22 branch October 23, 2025 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge amd-staging into amd-feature/wave-transform #348

merge amd-staging into amd-feature/wave-transform #348

Uh oh!

cdevadas commented Oct 22, 2025

Uh oh!

skganesan008 commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

133 participants

merge amd-staging into amd-feature/wave-transform #348

merge amd-staging into amd-feature/wave-transform #348

Uh oh!

Conversation

cdevadas commented Oct 22, 2025

Uh oh!

skganesan008 commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

133 participants