Skip to content

XeGPU RFC update: Add matrix_desc and operations for share local memory access #1092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Jianhui-Li
Copy link
Contributor

Please review these guidelines to help with the review process:

  • Have you provided a meaningful PR description?
  • Have you added a test, a reproducer, or a reference to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?
  • Have you organized your commits logically and ensured each can be built by itself?

@Jianhui-Li Jianhui-Li changed the title Add matrix_desc and operations XeGPU RFC update: Add matrix_desc and operations for share local memory Jul 8, 2025
@Jianhui-Li Jianhui-Li changed the title XeGPU RFC update: Add matrix_desc and operations for share local memory XeGPU RFC update: Add matrix_desc and operations for share local memory access Jul 8, 2025
@@ -329,6 +329,20 @@ Attribute `Memory_kind` describes the memory kind. "global" means the global mem

`nbarrier` and `fence` operations lower to uniform instructions, so there is no need to specify the `sg_map`.

## XeGPU operations to access share local memory
Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout).
User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're talking about WG-level here I think #1033 should me merged before this one


// Subview for DPAS tile shape
%ma = xegpu.matrix_desc_subview %m
: matrix_desc<32x256xf16> -> matrix_desc<256x32xf16, @block=[16, 16], #dpas_t_inst>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A <256x32> subview of <32x256>? Is it a "transposed view"? If so, shouldn't there be strides for subview result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants