Skip to content

Conversation

@cdevadas
Copy link

No description provided.

Pierre-vh and others added 30 commits October 20, 2025 12:03
…vm#161637)

The check used was not strong enough to prevent the insertion of sample/bvhcnt when they were not needed.
I assume SIInsertWaitCnts was trimming those away anyway, but this was a bug nonetheless.

We were inserting SAMPLE/BVHcnt waits in places where we only needed to wait on the previous atomic operation. Neither of these counter have any atomics associated with them.
When the legacy cost model scalarizes loads that are used as addresses
for other loads and stores, it looks to phi nodes, if they are direct
address operands of loads/stores. Match this behavior in
isUsedByLoadStoreAddress, to fix a divergence between legacy and
VPlan-based cost model.
No tests seem to need them here or downstream.
…lvm#163570)

Splats include two poison values, but only the poison-ness of the
splatted value actually matters.
A single-scalar replicate without side-effects, and with uniform
operands, is uniform. Special-case assumes and stores.
…4051)

There are cases where atomic constraints are independent of template
parameters, yet we still have a template parameter mapping.

We don't bother translating template arguments for them. Note that we
retain an empty parameter mapping rather than none at all, as the former
may improve cache hit rates (We don't profile MLTAL but profile the
empty template argument list instead.)

This is a regression on trunk, so there's no release note.
When serializing SPIR-V MLIR containing externally linked function with
debug enabled, the serialization crashes as `printValueIDMap` tries to
print a key value that has been already destroyed. This happen as for
externally linked function the body of the function is erased, that
causes arguments to be destroyed as well, but the valueIDMap was never
updated.
…vm#164198)

createWidenCast doesn't set the flag type, so when we simplify trunc
(zext nneg x) -> zext x we would hit an assertion in CSE that the flag
types don't match with other VPWidenCastRecipes that weren't simplified.

This fixes it the same way trunc flags are handled too.

As an aside I think it should be correct to preserve the nneg flag in
this case since the input operand is still non-negative after the
transform. But that's left to another PR.

Fixes llvm#164171
Recipes in replicate regions implicitly depend on the region's
predicate. Limit CSE to recipes in the same block, when either recipe is
in a replicate region.

This allows handling VPPredInstPHIRecipe during CSE. If we perform CSE
on recipes inside a replicate region, we may end up with 2
VPPredInstPHIRecipes sharing the same operand. This is incompatible with
current VPPredInstPHIRecipe codegen, which re-sets the current value of
its operand in VPTransformState. This can cause crashes in the added
test cases.

Note that this patch only modifies ::isEqual to check for replicating
regions and not getHash, as CSE across replicating regions should be
uncommon.

Fixes llvm#157314. 
Fixes llvm#161974.

PR: llvm#162110
…64204)

One iteration of this loop might've already fixed up the pointers of
coming globals, so check for that explicitly.

Fixes llvm#164151
- typos
- use insert instead of back inserter to make allocations more efficient
- make sure the constraint cache uses canonicalized keys to avoid
redundant work
I was a bit too eager to remove the SI_WHOLE_WAVE_FUNC_SETUP instruction
during prolog emission. Erasing it invalidates MBBI, which in some cases
is still needed outside of `emitCSRSpillStores`.

Do the erasing at the end of prolog insertion instead.
Part of llvm#102817.

This patch attempts to optimize the performance of `std::generate` for
segmented iterators. Below are the benchmark numbers from
`libcxx\test\benchmarks\algorithms\modifying\generate.bench.cpp`. Test
cases that use segmented iterators have also been added.

- before

```
std::generate(deque<int>)/32           194 ns          193 ns      3733333
std::generate(deque<int>)/50           276 ns          276 ns      2488889
std::generate(deque<int>)/1024        5096 ns         5022 ns       112000
std::generate(deque<int>)/8192       40806 ns        40806 ns        17231
```

- after

```
std::generate(deque<int>)/32           106 ns          105 ns      6400000
std::generate(deque<int>)/50           139 ns          138 ns      4977778
std::generate(deque<int>)/1024        2713 ns         2699 ns       248889
std::generate(deque<int>)/8192       18983 ns        19252 ns        37333
```

---------

Co-authored-by: A. Jiang <[email protected]>
…nteger registers (llvm#163646)

Fix the `RegisterValue::SetValueFromData` method so that it works also
for 128-bit registers that contain integers.

Without this change, the `RegisterValue::SetValueFromData` method does
not work correctly
for 128-bit registers that contain (signed or unsigned) integers.

---

Steps to reproduce the problem:

(1)

Create a program that writes a 128-bit number to a 128-bit registers
`xmm0`. E.g.:
```
#include <stdint.h>

int main() {
  __asm__ volatile (
      "pinsrq $0, %[lo], %%xmm0\n\t"  // insert low 64 bits
      "pinsrq $1, %[hi], %%xmm0"    // insert high 64 bits
      :
      : [lo]"r"(0x7766554433221100),
        [hi]"r"(0xffeeddccbbaa9988)
  );
  return 0;
}
```

(2)

Compile this program with LLVM compiler:
```
$ $YOUR/clang -g -o main main.c
```

(3)

Modify LLDB so that when it will be reading value from the `xmm0`
register, instead of assuming that it is vector register, it will treat
it as if it contain an integer. This can be achieved e.g. this way:
```
diff --git a/lldb/source/Utility/RegisterValue.cpp b/lldb/source/Utility/RegisterValue.cpp
index 0e99451..a4b51db3e56d 100644
--- a/lldb/source/Utility/RegisterValue.cpp
+++ b/lldb/source/Utility/RegisterValue.cpp
@@ -188,6 +188,7 @@ Status RegisterValue::SetValueFromData(const RegisterInfo &reg_info,
     break;
   case eEncodingUint:
   case eEncodingSint:
+  case eEncodingVector:
     if (reg_info.byte_size == 1)
       SetUInt8(src.GetMaxU32(&src_offset, src_len));
     else if (reg_info.byte_size <= 2)
@@ -217,23 +218,6 @@ Status RegisterValue::SetValueFromData(const RegisterInfo &reg_info,
     else if (reg_info.byte_size == sizeof(long double))
       SetLongDouble(src.GetLongDouble(&src_offset));
     break;
-  case eEncodingVector: {
-    m_type = eTypeBytes;
-    assert(reg_info.byte_size <= kMaxRegisterByteSize);
-    buffer.bytes.resize(reg_info.byte_size);
-    buffer.byte_order = src.GetByteOrder();
-    if (src.CopyByteOrderedData(
-            src_offset,          // offset within "src" to start extracting data
-            src_len,             // src length
-            buffer.bytes.data(), // dst buffer
-            buffer.bytes.size(), // dst length
-            buffer.byte_order) == 0) // dst byte order
-    {
-      error = Status::FromErrorStringWithFormat(
-          "failed to copy data for register write of %s", reg_info.name);
-      return error;
-    }
-  }
   }
 
   if (m_type == eTypeInvalid)
```

(4)

Rebuild the LLDB.

(5)

Observe what happens how LLDB will print the content of this register
after it was initialized with 128-bit value.
```
$YOUR/lldb --source ./main
(lldb) target create main
Current executable set to '.../main' (x86_64).
(lldb) breakpoint set --file main.c --line 11
Breakpoint 1: where = main`main + 45 at main.c:11:3, address = 0x000000000000164d
(lldb) settings set stop-line-count-before 20
(lldb) process launch
Process 2568735 launched: '.../main' (x86_64)
Process 2568735 stopped
* thread #1, name = 'main', stop reason = breakpoint 1.1
    frame #0: 0x000055555555564d main`main at main.c:11:3
   1   	#include <stdint.h>
   2   	
   3   	int main() {
   4   	  __asm__ volatile (
   5   	      "pinsrq $0, %[lo], %%xmm0\n\t"  // insert low 64 bits
   6   	      "pinsrq $1, %[hi], %%xmm0"    // insert high 64 bits
   7   	      :
   8   	      : [lo]"r"(0x7766554433221100),
   9   	        [hi]"r"(0xffeeddccbbaa9988)
   10  	  );
-> 11  	  return 0;
   12  	}
(lldb) register read --format hex xmm0
    xmm0 = 0x7766554433221100ffeeddccbbaa9988
```

You can see that the upper and lower 64-bit wide halves are swapped.

---------

Co-authored-by: Matej Košík <[email protected]>
This patch pivots GPR32 and GPR64 zeroing into distinct branches to
simplify the code an improve the lowering.

Zeroing GPR moves are now handled differently than non-zeroing ones.
Zero source registers WZR and XZR do not require register annotations of
undef, implicit and kill. The non-zeroing source now cannot process WZR
removing the ternary expression. This patch also moves GPR64 logic right
after GPR32 for better organization.
…m#164071)

Add documentation for the no-rollback conversion driver. Also improve
the documentation of the old rollback driver. In particular: which
modifications are performed immediately and which are delayed.
Handle ptrtoaddr the same way as ptrtoint. The fold already only
operates on the index/address bits.
If a main instruction in the copyables is a div-like instruction, the
compiler cannot pack duplicates, extending with poisons, these
instructions, being vectorize, will result in undefined behavior.

Fixes llvm#164185
`UnqualPtrTy` didn't always match `llvm::PointerType::getUnqual`:
sometimes it returned a pointer that is not in address space 0 (notably
for SPIRV).

Since `UnqualPtrTy` was used as the "generic" or "default" pointer type,
this patch renames it to `DefaultPtrTy` to avoid confusion with LLVM's
`PointerType::getUnqual`.
All the existing tests test code either in ConstantFolding or
InstSimplify, so move them to use -passes=instsimplify instead of
-passes=instcombine. This makes sure we keep InstSimplify coverage
even if there are subsuming InstCombine folds.

This requires writing some of the constant folding tests in a
different way, as InstSimplify does not try to re-fold already
existing constant expressions.
This reverts commit 1943c9e.

This took out quite a few buildbots. Some of the Z3 test cases are failing
and enabling this is causing some LLVM tests to begin failing.
Add parsing and semantic checks for DEVICE_SAFESYNC clause. No lowering.
This PR fixes a crash in the `bf_getbuffer` implementation of
`PyDenseElementsAttribute` that occurred when an element type was not
supported, such as `bf16`. I believe that supportion `bf16` is not
possible with that protocol but that's out of the scope of this PR.
Previsouly, the code raised an `std::exception` out of `bf_getbuffer`
that nanobind does not catch (see also pybind/pybind11#3336). The PR
makes the function catch all `std::exception`s and manually raises a
Python exception instead.

Signed-off-by: Ingo Müller <[email protected]>
Add test with urem guard with non-constant divisor and AddRec guards.

Extra test coverage for llvm#163021
OpenACC 3.4 includes the ability to add an 'if' to an atomic operation.

From the change log:
`Added the if clause to the atomic construct to enable conditional
atomic operations based867
on the parallelism strategy employed`

In 2.12, the C/C++ grammar is changed to say: 
`#pragma acc atomic [ atomic-clause ] [ if( condition ) ] new-line`

With corresponding changes to the Fortran standard

This patch adds support to this for the dialect, so that Clang can use
it soon.
…es (llvm#163972)

The lowering of `!$acc loop` loops with an early exit currently ends-up
"duplicating" the control flow in the acc.loop and inside it as explicit
control flow (as if each iteration executes each iteration until the
early exit).

Add a TODO for now.
Michael137 and others added 24 commits October 21, 2025 18:00
…lvm#164371)

Was browsing through this and the do/while loop (in addition to the
local `ret` counter which only ever gets incremented at most once) were
hard to reason about imo. This patch removes both in favour of
early-returns.
The `ClangDeclVendor` used to contain more Clang-specific code than it
does nowadays. But at this point, all it does is wrap the
`DeclVendor::FindDecls` call and copy the resulting decls into
`std::vector<clang::NamedDecl*>`. I.e., it converts the generic
`CompilerDecl`s to `clang::NamedDecl*`s.

In my opinion at this point it doesn't do enough to justify making it
part of the `DeclVendor` hierarchy.

This patch removes the `ClangDeclVendor` and instead does the conversion
at callsite.
…onals (llvm#164342)

If a select instruction is replaced with one whose conditional is the
negation of the original, then the replacement's branch weights are the
reverse of the original's.

Tracking issue: llvm#147390
Fix MSAN failure and expensive test failure.
Implement MIR2Vec embedder for generating vector representations of Machine IR instructions, basic blocks, and functions. This patch introduces changes necessary to *embed* machine opcodes. Machine operands would be handled incrementally in the upcoming patches.
)

Based on the double precision's sin/cos fast path algorithm:

Step 1: Perform range reduction `y = x mod pi/8` with target errors <
2^-54.
This is because the worst case mod pi/8 for single precision is ~2^-31,
so to have up to 1 ULP errors from
the range reduction, the targeted errors should `be 2^(-31 - 23) =
2^-54`.

Step 2: Polynomial approximation
We use degree-5 and degree-4 polynomials to approximate sin and cos of
the reduced angle respectively.

Step 3: Combine the results using trig identities
```math
\begin{align*}
  \sin(x) &= \sin(y) \cdot \cos(k \cdot \frac{\pi}{8}) + \cos(y) \cdot \sin(k \cdot \frac{\pi}{8}) \\
  \cos(x) &= \cos(y) \cdot \cos(k \cdot \frac{\pi}{8}) - \sin(y) \cdot \sin(k \cdot \frac{\pi}{8})
\end{align*}
```

Overall errors: <= 3 ULPs for default rounding modes (tested
exhaustively).

Current limitation: large range reduction requires FMA instructions for
binary32. This restriction will be removed in the followup PR.

---------

Co-authored-by: Petr Hosek <[email protected]>
…lvm#164455)

Add attributes to the unit tests required to pass `spirv-val`.


Addresses llvm#161852
)

Create a POSIX `<nl_types.h>` header with `catopen`, `catclose`, and
`catgets` function declarations.
Provide the stub/placeholder implementations which always return error.
This is consistent with the way
locales are currently (un-)implemented in llvm-libc.

Notably, providing `<nl_types.h>` fixes the last remaining issue with
building libc++ against llvm-libc
(on certain configuration of x86_64 Linux) after disabling threads and
wide-characters in libc++.
Upstream the basic support for the C++ try catch statement with a try
block that doesn't contain any call instructions and with a catch-all
statement

Issue llvm#154992
With llvm#163862, this is not really necessary and causes downstream issues.
llvm#162332)

Originally llvm#161912, we've now decided that an explicit GPL notification
is redundant with the LICENSE file, which is a common convention for
relaying this information.

Co-authored-by: Cameron McInally <[email protected]>
…lvm#164346)

llvm#140443 makes use of the CMake
variable `Python3_EXECUTABLE_DEBUG`, which was introduced in CMake
version 3.30. On systems with an inferior version of cmake, the lit
tests will try to run with an empty `config.python_executable`.

This PR adds a warning and falls back to using `Python3_EXECUTABLE` if
the CMake version is less than `3.30`.
Fixes test failure issues (caused by llvm#162161) in Windows buildbots.
This introduces the support for 32-bit ARM Fuchsia target which uses the
aapcs-linux ABI defaulting to thumbv8a as the target.
Implementation files using the Intel syntax explicitly specify it.
Do the same for the few files using AT&T syntax.

This also enables building LLVM with `-mllvm -x86-asm-syntax=intel` in one's Clang config files
(i.e. a global preference for Intel syntax).

No functional change intended.
…IL target (llvm#164472)

This is a temporary measure to explicitly remove the unrecognized named
metadata when targeting DXIL.

This should be removed for an allowlist as tracked here:
llvm#164473.
This commit introduces a base-class implementation for a method that
reads memory from multiple ranges at once. This implementation simply
calls the underlying `ReadMemoryFromInferior` method on each requested
range, intentionally bypassing the memory caching mechanism (though this
may be easily changed in the future).

`Process` implementations that can be perform this operation more
efficiently - e.g. with the MultiMemPacket described in [1] - are
expected to override this method.

As an example, this commit changes AppleObjCClassDescriptorV2 to use the
new API.

Note about the API
------------------

In the RFC, we discussed having the API return some kind of class
`ReadMemoryRangesResult`. However, while writing such a class, it became
clear that it was merely wrapping a vector, without providing anything
useful. For example, this class:

```
struct ReadMemoryRangesResult {
  ReadMemoryRangesResult(
      llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges)
      : ranges(std::move(ranges)) {}

  llvm::ArrayRef<llvm::MutableArrayRef<uint8_t>> getRanges() const {
    return ranges;
  }

private:
  llvm::SmallVector<llvm::MutableArrayRef<uint8_t>> ranges;
};
```

As can be seen in the added test and in the added use-case
(AppleObjCClassDescriptorV2), users of this API will just iterate over
the vector of memory buffers. So they want a return type that can be
iterated over, and the vector seems more natural than creating a new
class and defining iterators for it.

Likewise, in the RFC, we discussed wrapping the result into an
`Expected`. Upon experimenting with the code, this feels like it limits
what the API is able to do as the base class implementation never needs
to fail the entire result, it's the individual reads that may fail and
this is expressed through a zero-length result. Any derived classes
overriding `ReadMemoryRanges` should also never produce a top level
failure: if they did, they can just fall back to the base class
implementation, which would produce a better result.

The choice of having the caller allocate a buffer and pass it to
`Process::ReadMemoryRanges` is done mostly to follow conventions already
done in the Process class.



[1]:
https://discourse.llvm.org/t/rfc-a-new-vectorized-memory-read-packet/
…NFC)

Split off to clarify naming, as suggested in
llvm#156262.
PR to (hopefully) help Ron merge the upstream PR into downstream with
minmal issues
@skganesan008
Copy link
Collaborator

@cdevadas cdevadas merged commit 90d1195 into amd-feature/wave-transform Oct 23, 2025
14 checks passed
@cdevadas cdevadas deleted the amd/wave-transform/merge-oct22 branch October 23, 2025 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.