[CIR][CIRGen][builtin][X86] Implement support for _mm_getcsr() and _mm_setcsr() intrinsic #1681

Arthur-Chang016 · 2025-06-13T04:25:50Z

llvm#1372) Lower vgetq_lane_bf16, vduph f16 and fb16

This is the second part of [PR#1290](llvm#1290), adding initial support for `__cxa_rethrow`. So, the last PR did not support statements that come after the `UnreachableOp` from the `rethrow`, we just ignored them, e.g: ``` struct S { S() {} }; void foo() { int r = 1; try { throw; S s; } catch (...) { ++r; } } ``` This PR fixes this. A few changes: - `rethrow` statements split their block into multiple blocks. - Tests with statements that come after the `rethrow` were added and old ones were modified. Concern: - The `ScopeOp` workaround still exists which I guess was one of the main concerns when we tried in the last PR. As usual, I am open to discussions on this and how to approach it better -:)

This patch introduces `CIR_TBAAStructAttr`, which encodes the type and offset of each field in a struct, although it may lead to some duplication in `CIR`. If we manage `cir::TBAAStructAttr` by adding a new method to `ASTRecordDeclInterface`, it will also introduce duplication between `CIRGen` and LLVM lowering.

…vm#1370) Currently, trying to generate CIR for the following code snippet `yield.cpp` fails. Using `bin/clang++ yield.cpp -Xclang -fclangir -Xclang -emit-cir -S -o -`: ``` struct S { S() {}; int a; }; void foo() { try { S s; } catch (...) { foo(); } } ``` The error: ``` loc("yield.cpp":6:6): error: 'cir.yield' op must be the last operation in the parent block ``` There is an extra YieldOp! The CIR dump before verification looks something like: ``` "cir.scope"() ({ %0 = "cir.alloca"() <{alignment = 4 : i64, allocaType = !cir.struct<struct "S" {!cir.int<s, 32>} #cir.record.decl.ast>, ast = #cir.var.decl.ast, init, name = "s"}> : () -> !cir.ptr<!cir.struct<struct "S" {!cir.int<s, 32>} #cir.record.decl.ast>> "cir.try"() <{catch_types = [#cir.all]}> ({ "cir.call"(%0) <{callee = @_ZN1SC1Ev, calling_conv = 1 : i32, exception, extra_attrs = #cir<extra({})>, side_effect = 1 : i32}> ({ "cir.yield"() : () -> () }) : (!cir.ptr<!cir.struct<struct "S" {!cir.int<s, 32>} #cir.record.decl.ast>>) -> () "cir.yield"() : () -> () }, { %1 = "cir.catch_param"() : () -> !cir.ptr<!cir.void> "cir.call"() <{ast = #cir.call.expr.ast, callee = @_Z3foov, calling_conv = 1 : i32, exception, extra_attrs = #cir<extra({})>, side_effect = 1 : i32}> ({ "cir.yield"() : () -> () "cir.yield"() : () -> () <--- **DUPLICATE** }) : () -> () "cir.yield"() : () -> () }) : () -> () "cir.yield"() : () -> () }, { }) : () -> () ``` This PR adds a check for an already existing terminator before creating a YieldOp during the cleanup.

Original Clang's implementation: https://github.com/llvm/clangir/blob/2df2022c90e00c0eac542eba4078e79306155c7b/clang/lib/CodeGen/CGBuiltin.cpp#L4131-L4133 https://github.com/llvm/clangir/blob/2df2022c90e00c0eac542eba4078e79306155c7b/clang/lib/CodeGen/CGBuiltin.cpp#L762-L776 Compared with non-elementwise exp https://github.com/llvm/clangir/blob/2df2022c90e00c0eac542eba4078e79306155c7b/clang/lib/CodeGen/CGBuiltin.cpp#L3006-L3016 https://github.com/llvm/clangir/blob/2df2022c90e00c0eac542eba4078e79306155c7b/clang/lib/CodeGen/CGBuiltin.cpp#L669-L685 elementwise version doesn't handle constrained situation. I'm not sure whether it is intended. For renaming, it is to match original clang's implementation closely. Resolves: llvm#1375

Lower vrsrad_n u64 and s64

There were problems with the pointer type and element type of the Address class getting out of sync. In the traditional codegen the pointer has no type, so it was sufficient for the Address class to simply track the type it was supposed to be pointing to. Since ClangIR pointer values are typed, the Address::withType function wasn't really doing what it was supposed to. It returned an object with the same pointer that the original object had, but with a mismatched element type. This change updates the Address::withType function to perform a bitcast to get the expected pointer type before creating a new Address object. It also adds assertions in the Address class to verify that pointer type and element type are consistent and updates many places that were causing those assertions to fire. These code changes cause extra bitcasts to be emitted in a few places. Regression tests have been updated as needed to reflect the CIR that is now generated.

Added a skeleton of NVPTX target lowering info. This enables lowering of `simple.cu` (as it hardly tests device side functionalities), so a test of LLVM IR is also added onto it.

This is a preparation of generating registration functions in LoweringPrepare. CUDA compilation works as follows (irrelevant arguments omitted): ```sh # First compile for device, generating PTX assembly clang++ test.cu -fcuda-is-device -o device.s # Convert that into a binary file ptxas device.s --output-file device.o fatbin --create device.fatbin --image=profile=sm_52,file=device.o # Pass that file as an argument to host clang++ test.cu -fcuda-include-gpubinary device.fatbin -cuid="some unique id" ``` And from the name of GPU binary, we can obtain a handle for registration. So we add an attribute to ModuleOp, recording that name. If that `-fcuda-include-gpubinary` is not specified (like in the test `simple.cu`), OG will not generate any registration function. We do the same here by not generating the attribute.

…m#1383) Lower neon_vrshrd_n for s64 and u64

…lvm#1385) Lower builtin_neon_vqshld_n s64 and u64

This PR adds support for LLVM lowering of pointers to member functions. The lowering is ABI-specific and this patch only considers Itanium ABI. Itanium ABI lowers pointers to member functions to a struct with two fields of type `ptrdiff_t`. To extract fields from such aggregate values, this PR includes a new operation `cir.extract_member` to accomplish this.

) Lower builtin_neon_vqshlud_n_s64

This PR adds CIRGen and LLVM lowering support for the `__builtin_bitreverse` family of builtin functions.

Lower neon_vrnd32x

This deals with some x86 aggregate types for CallConvLowering pass. Suppose we have a simple struct like this. ```cpp struct dim3 { int x, y, z; }; ``` It can be coerced into ```cpp struct dim3_ { uint64_t xy; int z; }; ``` And for a function that receives it as an argument, OG does the following transformation for x86: ```cpp void f(dim3 arg) { /* Before */ } void f(uint64_t xy, int z) { /* After */ } ``` Now this transformation is implemented in the CallConvLowering pass of CIR.

I checked https://github.com/llvm/clangir/blob/main/clang/test/CIR/CodeGen/globals.cpp and thought code works as expected. Although, test results need to be adjusted a bit. Resolves: llvm#1252

Lower vrnd32z and vrnd32zq

This PR adds support for returns inside of a TryOp, for example: ``` void foo() { int r = 1; try { return; ++r; } catch (...) { } } ``` Currently, it fails during the CodeGen with: ``` error: 'cir.return' op expects parent op to be one of 'cir.func, cir.scope, cir.if, cir.switch, cir.do, cir.while, cir.for, cir.case' ``` were TryOp's omitted on purpose?

Change the assembly format for `cir::FuncType` from ``` !cir.func<!returnType (!argType)> ``` to ``` !cir.func<(!argType) -> !returnType> ``` This change (1) is easier to parse because it doesn't require lookahead, (2) is consistent with the format of ClangIR `FuncOp` assembly, and (3) is consistent with function types in other MLIR dialects. Change all the tests to use or to expect the new format for function types. The contents and the semantics of `cir::FuncType` are unchanged. Only the assembly format is being changed. Functions that return `void` in C or C++ are still represented in MLIR as having no return type. Most of the changes are in `parseFuncType` and `printFuncType` and the functions they call in `CIRTypes.cpp`. A `FuncType::verify` function was added to check that an explicit return type is not `cir::VoidType`. `FuncType::isVoid()` was renamed to `FuncType::hasVoidReturn()` Some comments for `FuncType` were improved. An `llvm_unreachable` was added to `StructType::getKindAsStr` to suppress a compiler warning and to catch a memory error that corrupts the `RecordKind` field. (This was a drive-by fix and has nothing to do with the rest of this change.)

Includes function name while mangling C functions to avoid link errors. [This is the same way OG handles it](https://github.com/llvm/clangir/blob/c301b4a0d3d2d79b26c9c809c11b8a1137c0a9ec/clang/lib/CodeGen/CodeGenModule.cpp#L1896).

Commit the .clang-tidy files for ClangIR. The biggest different between these and the Clang files is the capitalization of variable names. Most ClangIR code follows the MLIR conventions instead of the Clang conventions.

llvm#1390) The CIRGen support is already there. This PR adds LLVM lowering support for comparisons between pointers to member functions. Note that pointers to member functions could only be compared for equality.

Run clang-tidy on `CIRGenFunction.cpp` and fix all the issues that were identified. The vast majority of issues were the case of parameter and variable names. Some of the issues that didn't have to do with names had to be resolved manually. There should be no behavior changes.

Lower vrnd64x and vrnd64xq

To give LoweringPrepare type information from `CIRGenTypeCache`, this PR adds two attributes to ModuleOp: ```mlir module attributes { cir.int_size = #cir.int_size<32>, cir.size_type_size = #cir.size_type_size<64>, ... } {} ``` The `CIRDataLayout` class is also extended to have `getPtrDiffTy` and so on. Some tests that only expects `cir.lang` and `cir.sob` are also changed to take this into account.

Fixes llvm#1619 Fixes llvm#1618

The current implementation incorrectly uses `mlir::IntegerAttr::get` with the unsupported type `cir::IntType`, which is not compatible and was never tested. As discussed in PR llvm#1645, this is to be marked as NYI until a proper implementation is provided.

Backport the calculation of maskbits in the lowering from `N - 1` to `NextPowerOf2(numElements - 1) - 1`, similar to Clang CG. Backport from [#141411](llvm/llvm-project#141411)

…lvm#1651) We used to insert a continue Block at the end of a flattened ternary op that only contained a branch to the remaing operation of the remaining Block. This patch removes that continue block and changes the true/false blocks to directly jump to the remaining ops. With this patch the CIR now generates exactly the same LLVM IR as the original codegen.

@mmha

This PR fixes [Issue#1647](llvm#1647). It just takes the implementation from [`emitRethrow`](https://github.com/llvm/clangir/blob/105d898b9898d224f0baca4b161a84bdcf817617/clang/lib/CIR/CodeGen/CIRGenItaniumCXXABI.cpp#L2273C1-L2276C77) and extends the same logic to `emitThrow`. The only nitpick about the fix is same as before - we have this [redundant ScopeOp](https://github.com/llvm/clangir/blob/105d898b9898d224f0baca4b161a84bdcf817617/clang/lib/CIR/CodeGen/CIRGenItaniumCXXABI.cpp#L2298C1-L2303C37) which acts as a placeholder, so there are some redundant yield blocks in some cases. Aside that, I believe this fix is okay for now. I have added the tests from the issue to confirm everything works as intended. cc: @mmha, @bcardosolopes.

We already have constraints in CIROps to make sure that the operands and result type are vectors - [VecCmpOp](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/include/clang/CIR/Dialect/IR/CIROps.td#L3237), - [VecTernaryOp](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/include/clang/CIR/Dialect/IR/CIROps.td#L3269-L3272)

…#1661)

LLVM dialect now has ptrmask intrinsic, use it instead of the manual computation Fix bitwidth of the generated mask in ABIInfoImpl

Backport support global initialization for ComplexType from (llvm/llvm-project#141369)

…pe()` (llvm#1662)

…lvm#1666) This adds missing print of `dso_local` to FuncOp. Attribute `dsolocal` was renamed in both `FuncOp` and `GlobalOp` to align with LLVM naming.

Implement atan2 intrinsic as part of llvm#1192

…llvm#1660) Fixes llvm#1405 ## Improving eraseIfSafe: as far as I understand it eraseIfSafe should intuativly check if all memref load/store ops are created, which obtain offsets from the memref.reinterpret_cast in the eraseList. If so the operations in the eraseList are erased, otherwise they are kept until all cir.load/store ops relying on them are lowered. One challenge here is that we can't actually do this by checking the uses of memref.reinterpret_cast operations, as their results aren't actually used in the created memref load/store ops (the base alloca result found via findBaseAndIndices is used). Because of this, this base alloca result is passed as the newAddr Value to eraseIfSafe in the [cir.load](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CIR/Lowering/ThroughMLIR/LowerCIRToMLIR.cpp#L236C5-L242C6)/[cir.store](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CIR/Lowering/ThroughMLIR/LowerCIRToMLIR.cpp#L266C1-L271C6) lowerings. Currently the eraseIfSafe function counts all memref.load/store values that use this base address: https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CIR/Lowering/ThroughMLIR/LowerCIRToMLIR.cpp#L215-L218 The problem here is that this also counts all the other memref.load/store ops, which store/load to/from the base address, but don't use the memref.reinterpret_cast ops to obtain the offsets. Because of this the lowering fails if multiple store/loads to/from the same array are performed in the original C code as in the example of issue llvm#1405. Because we are erroneously counting all the other memref.load/store ops, the newUsedNum is (for the later stores) larger than oldUsedNum (only the uses of the cir.ptr_stride op) and the memref.reinterpret_cast ops are not removed. This PR contains a first attempt to fix this (i.e only count the memref.load/store ops, which obtain offsets from the memref.reinterpret_cast in the eraseList). I only count memref.load/store ops, if the first offset value, corresponds to the offset value in the last memref.reinterpret_cast. Limitations of this PR: This fixes the indirect lowering of the example in issue llvm#1405 and also works for other test I made where multiple store/loads to/from the same array, but assumes two thing to be the case: 1. The cir.const used as the stride in the cir.ptr_str is not reused in other cir.ptr_stride ops 2. Only the last cir.ptr_stride can have multiple uses (for multidim arrays) Both of these assumptions seem to be true for the C-Code I testet (for the translation of accesses to C/C++ Arrays to cir ops). But the eraseIfSafe function might need to be changed/further improved in the future to support cases, where those assumptions fail. For example if an optimization is run on cir where the cir.const ops with the same value are reused for the different cir.ptr_stride ops, the indirect lowering would still fail. Or if in a multidimensional array a subarray is accessed, e.g. ```c int arr[3][4]; int *row = arr[1]; ``` (Note: I pretty sure for this it isn't suffiicient to just extend the function to check if all offset value, corresponds to the offset value in the all memref.reinterpret_cast, but we would probably need to seperatly check for each memref.reinterpret_cast if it can be removed (instead of removing all or non in the eraseList)) ## Improving the lowering of canonical ForOps: While debugging issue llvm#1405 I noticed a few thing that I think could be improved in the canonical ForOp lowering: 1. There is one edge case, where the forOp should not be marked as canonical in my opinion: ```c int i; for (i = 0; i < 100; i++); i += 10; ``` (with the current lowering this for is marked canonical but since i is replaced by the induction var of the scf.for op and the actual memory representing i is not updated i has a wrong value after the for. This is avoided when we lower this for as a non canonical for.) 2. I think we can directly replace the loads to the CIR.IV with the scf.IV and not create the dummy arith.add IV, 0 op (I think this might be a relic from previous MLIR version where the replaceOp only worked with operations (not values). This make the IR more readable and easier to understand. If I'm missing somethink here and the arith.add IV, 0 has a purpose I'm not seeing let me know. 3. When implementing the change in 1, we know that in a canonical for the induction variable is definied inside the for and is only valid here. Because of this and since we replace the loads of the cir IV with the scf.IV we can remove the unneccessary alloca and store op created for the cir.IV (These changes only show up in an non-optimized binary, but aren't relevant when running clang with optimations, I still think they improve the readability + understandability of the core ir) ## Running SCFPreparePass cir pass always when lowering throughMLIR: I also noticed, that we are currently only running the SCFPreparePass when we are printing the result of the cir to core dialect translation. https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CIR/CodeGen/CIRPasses.cpp#L84-L85 Because of this compiling to an object file (or llvm IR) with the indirect lowering path fails, if the code contains a canonical for. I suggest always running this pass, when were going throughMLIR. ## Passing through is_nontemporal in loads/store lowerings: Since the corresponding memref ops also have this attribute it's basically just passing through a boolean (and doesn't need any special handling, I think). Even tho there is probably no practical application now I think this might avoid bugs/confusion in the future. If there is any reason not to do this let me know. I also added a new test case for arrays, adjusted the canonical forOp test to reflect the made changes and combined the non canonical forOp tests into one file and added a test case for the edge case describe before. (Note: if I find the time I will try to run the SingleSource test suite with the throughMLIR lowering in the next week to get a better idea, where we are with this pipeline. In general I agree with everything discussed in issue llvm#1219, but I think we probably can already add more support in regard to arrays (and maybe pointers) with the existing mlir core constructs)

) This PR introduces [`TryMarkNoThrow`](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CodeGen/CodeGenFunction.cpp#L1394). [`isInterposable`](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/lib/CodeGen/CodeGenFunction.cpp#L1397C10-L1397C26) isn't fully implemented and I'm not quite sure we need it? Anyway, I have introduced a missing feature `getSemanticInterposition` relevant for its completion. I have also updated an old test -- [`foo()`](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/clang/test/CIR/CodeGen/try-catch-dtors.cpp#L313) should be marked as unwind/nothrow. I have compared with the original CodeGen and attached the llvm output for verification. One concern I have is if the cases I have to mimic [`mayThrow`](https://github.com/llvm/clangir/blob/6e5fa09550c98f84d017873ed3e5667fd5fd909c/llvm/lib/IR/Instruction.cpp#L1158) from the OG are enough, please let me know your thoughts.

This PR adds support for the `-fdump-record-layouts` flag. It enables printing both the `CIRGenRecordLayout` and the `ASTRecordLayout`, similar to what is done in CodeGen.

Backport support for Complex value initialization from the empty InitList. Backported from llvm/llvm-project#143192

) Currently we can't handle continues nested under `IfOp`, because if we replace it with a yield, then it only breaks out of that `if`-statement, rather than continuing the whole loop. Perhaps that should be done by changing the whole structure of the while loop. Co-authored-by: Yue Huang <[email protected]>

…llvm#1670) Backport the VecShuffleOp verifier to catch invalid index Implemented in llvm/llvm-project#143262

…llvm#1673) When we process a completed Enum type, we were checking to see if the completed type was in the type cache and clearing the cache if the completed and converted underlying type for the enum doesn't pass an `isInteger(32)` check. Unfortunately, this checks to see if the type is the MLIR builtin 32-bit integer type, whereas it will always be a CIR integer type, so the check always fails. I don't believe there can ever be a case where the forward declared type for the enum doesn't match the completed type, so we should never need to clear the cache. This change replaces the previous check with an assert that compares the actual completed type to the cached type.

…vm#1672) This removes unnecessary parens from the assembly format of BaseClassAddrOp, DerivedClassAddrOp, BaseDataMemberOp, DerivedDataMemberOp, BaseMethodOp, and DerivedMethodOp to bring them into conformance with the CIR ASM Style Guide. The is no function change beyond the assembly format change.

Arthur-Chang016 · 2025-06-13T04:37:20Z

clang -S -emit-llvm produce

define dso_local void @test_mm_getcsr()() #0 !dbg !43 {
  %1 = alloca i32, align 4
  call void @llvm.x86.sse.stmxcsr(ptr %1), !dbg !46
  %2 = load i32, ptr %1, align 4, !dbg !46
  ret void, !dbg !47
}

define dso_local void @set_csr()() #0 !dbg !48 {
  %1 = alloca i32, align 4
  store i32 0, ptr %1, align 4, !dbg !49
  call void @llvm.x86.sse.ldmxcsr(ptr %1), !dbg !49
  ret void, !dbg !50
}

However func signature of _mm_setcsr and _mm_getcsr are a bit different from llvm's x86 intrinsic, so needs to do extra handling to be compatible. I tried to produce the same signature in CIR as in LLVM and will eventually generate the same code as direct c->llvm.

void _mm_getcsr(unsigned int);
unsigned int _mm_getcsr();

Arthur-Chang016 · 2025-06-13T04:37:33Z

Thanks for the review !

… through

bcardosolopes · 2025-06-17T23:41:41Z

clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp

+    // note that _mm_getcsr() returns uint, but llvm.x86.sse.stmxcsr takes i32
+    // pointer and returns void. So needs alloc extra memory to store the
+    // result.
+    auto loc = getLoc(E->getExprLoc());


It would be better to match OG here, any reason not to use createMemTemp?

bcardosolopes · 2025-06-17T23:41:49Z

clang/lib/CIR/CodeGen/CIRGenBuiltinX86.cpp

+  }
+  case X86::BI__builtin_ia32_ldmxcsr:
+  case X86::BI_mm_setcsr: {
+    auto loc = getLoc(E->getExprLoc());


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CIR][CIRGen][builtin][X86] Implement support for _mm_getcsr() and _mm_setcsr() intrinsic #1681

[CIR][CIRGen][builtin][X86] Implement support for _mm_getcsr() and _mm_setcsr() intrinsic #1681

Uh oh!

Arthur-Chang016 commented Jun 13, 2025 •

edited

Loading

Uh oh!

Arthur-Chang016 commented Jun 13, 2025

Uh oh!

Arthur-Chang016 commented Jun 13, 2025

Uh oh!

bcardosolopes Jun 17, 2025

Uh oh!

bcardosolopes Jun 17, 2025

Uh oh!

Uh oh!

[CIR][CIRGen][builtin][X86] Implement support for _mm_getcsr() and _mm_setcsr() intrinsic #1681

Are you sure you want to change the base?

[CIR][CIRGen][builtin][X86] Implement support for _mm_getcsr() and _mm_setcsr() intrinsic #1681

Uh oh!

Conversation

Arthur-Chang016 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Arthur-Chang016 commented Jun 13, 2025

Uh oh!

Arthur-Chang016 commented Jun 13, 2025

Uh oh!

bcardosolopes Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

bcardosolopes Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Arthur-Chang016 commented Jun 13, 2025 •

edited

Loading