Skip to content

Conversation

astroC86
Copy link
Contributor

@astroC86 astroC86 commented Aug 20, 2025

Motivation

Closes #98
Implements copy. I keep the get and put as wrappers to the copy function so that the tests pass

Technical Details

Test Plan

Test Result

(.iris_dev) root@2-6-0-gpu-mi300x1-192gb-devcloud-atl1:~/iris# mpirun -np 2 pytest ./tests/unittests/test_get.py 
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
rootdir: /root/iris
configfile: pyproject.toml
plugins: mpi-0.6
collecting ... rootdir: /root/iris
configfile: pyproject.toml
plugins: mpi-0.6
collected 16 items                                                                                                                                                                                              

collected 16 items                                                                                                                                                                                              

tests/unittests/test_get.py ................................                                                                                                                                                              [100%]                                                                                                                                                              [100%]



============================================================================================== 16 passed in 13.63s ==============================================================================================
============================================================================================== 16 passed in 13.63s ==============================================================================================
(.iris_dev) root@2-6-0-gpu-mi300x1-192gb-devcloud-atl1:~/iris# mpirun -np 2 pytest ./tests/unittests/test_put.py 
============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
rootdir: /root/iris
configfile: pyproject.toml
plugins: mpi-0.6
collecting ... ============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
rootdir: /root/iris
configfile: pyproject.toml
plugins: mpi-0.6
collected 16 items                                                                                                                                                                                              
collected 16 items                                                                                                                                                                                              

tests/unittests/test_put.py 
tests/unittests/test_put.py ................................                                                                                                                                                              [100%]                                                                                                                                                              [100%]

============================================================================================== 16 passed in 13.52s ==============================================================================================


============================================================================================== 16 passed in 13.51s ==============================================================================================

Submission Checklist

@astroC86 astroC86 force-pushed the astroC86/get-or-put-to-copy branch from 894410f to c2ca89c Compare August 20, 2025 20:38
@astroC86 astroC86 force-pushed the astroC86/get-or-put-to-copy branch from 940e3e9 to 3853f82 Compare August 20, 2025 20:39
@mawad-amd mawad-amd added core Core Iris library development iris Iris project issue labels Aug 21, 2025
@astroC86 astroC86 requested a review from mawad-amd August 22, 2025 12:31
@neoblizz
Copy link
Member

This PR will have to update examples that use put or get, there should be a few.

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are a few other places where we use iris.put. But, I think there something about the semantics of the copy is not correct.

# load to local register, accumulate.
for target_rank in range(num_ranks):
iris.get(data + offsets, results + offsets, cur_rank, target_rank, heap_bases, mask=mask)
iris.copy(data + offsets, results + offsets, cur_rank, target_rank, heap_bases, mask=mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
iris.copy(data + offsets, results + offsets, cur_rank, target_rank, heap_bases, mask=mask)
iris.copy(data + offsets, results + offsets, target_rank, cur_rank, heap_bases, mask=mask)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I suggest fails the test btw but it shouldn't according to the docstring.

Copy link
Contributor Author

@astroC86 astroC86 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume here the from_rank is always the current rank. I think if we want to allow for interchanging then perhaps this would more appropriate ? let me know what you think

@triton.jit
def copy(src_ptr, dst_ptr, from_rank, to_rank, cur_rank, heap_bases, mask=None):
    assert cur_rank == from_rank or cur_rank == to_rank, "Cannot copy between two arbitrary ranks"
    
    cur_base  = tl.load(heap_bases + cur_rank)

    from_base = tl.load(heap_bases + from_rank)
    to_base   = tl.load(heap_bases + to_rank)

    src_ptr_int = tl.cast(src_ptr, tl.uint64)
    src_offset = src_ptr_int - cur_base

    dst_ptr_int = tl.cast(dst_ptr, tl.uint64)
    dst_offset  = dst_ptr_int - cur_base

    from_base_byte = tl.cast(from_base, tl.pointer_type(tl.int8))
    to_base_byte   = tl.cast(to_base  , tl.pointer_type(tl.int8))

    translated_src = tl.cast(from_base_byte + src_offset, src_ptr.dtype)
    translated_dst = tl.cast(to_base_byte   + dst_offset, src_ptr.dtype)

    data = tl.load(translated_src, mask=mask)
    tl.store(translated_dst, data, mask=mask)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution you proposed here is good but adds that additional overhead of the two translates. I have been thinking about this and I am not sure if there is away to resolve this cleanly.

I don’t really like the put/get names but maybe we will just stick to them for now. Let’s keep this PR open for now and we can come back to it later if we get better ideas. Thanks for your time looking into this and sorry this feature was not very well thought through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi no worries at all! Thanks a lot for taking the time to review my solution!

@neoblizz
Copy link
Member

@mawad-amd @astroC86 can we come back to this? Consider merging it while retaining put/get.

@astroC86
Copy link
Contributor Author

@neoblizz sure

@astroC86 astroC86 force-pushed the astroC86/get-or-put-to-copy branch from 9b6f23f to 877df70 Compare September 13, 2025 02:54
@astroC86
Copy link
Contributor Author

astroC86 commented Sep 13, 2025

@neoblizz @mawad-amd I can add more tests but let me know what you think of the curr impl (i think it is a little controversial but safer)

iris/util.py Outdated
Comment on lines 147 to 164
@triton.jit
def trap_if(cond):
drv = tl.zeros([1], dtype=tl.uint32)
cond_u32 = tl.where(cond, drv, drv + 1)
if tl.program_id(0) == 0:
tl.inline_asm_elementwise(
asm="""
s_cmp_lg_u32 $1, 0
s_cbranch_scc1 0f
s_trap 2
0:
""",
constraints="=v,s",
args=[cond_u32],
dtype=tl.uint32,
is_pure=False,
pack=1,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need this? I think inline_asm will also make it difficult for the compiler to do optimization as this will show up as a black box (I think).

Copy link
Contributor Author

@astroC86 astroC86 Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call trap_if((cur_rank != from_rank) and (cur_rank != to_rank)) inside the copy function if copy is called where neither from_rank or to_rank are the cur_rank that is we tried to perform a copy between two arbitrary ranks which is something we dont support and in this case we call a trap. (I implemented the last solution which @mawad-amd thought was better than the one intially implemented) or we can remove it and say that it is undefined behavior

@astroC86 astroC86 force-pushed the astroC86/get-or-put-to-copy branch from 96b6001 to fcc41e8 Compare September 13, 2025 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core Iris library development iris Iris project issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Update put and get to copy
3 participants