Use cudaMemcpyAsync rather than kernel when possible #1088

cliffburdick · 2025-11-06T21:50:30Z

No description provided.

copy-pr-bot · 2025-11-06T21:50:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cliffburdick · 2025-11-06T21:50:37Z

/build

greptile-apps

Greptile Overview

Greptile Summary

Optimizes tensor-to-tensor assignment operations by using cudaMemcpyAsync instead of launching a kernel when both tensors are contiguous, and fixes incorrect preprocessor macro usage in tensor prefetch methods.

Key changes:

Fixed CUDA_VERSION to CUDART_VERSION in tensor.h:744 and tensor.h:768 for correct CUDA runtime version detection
Added optimization path in base_operator.h that detects contiguous tensor-to-tensor assignments and uses cudaMemcpyAsync instead of kernel execution
Includes aliased memory checking before the optimized copy
Falls back to kernel execution for non-contiguous tensors

Issues found:

Critical bug in base_operator.h:212: uses tp->get_lhs().Bytes() as the copy size, which will read past the end of RHS memory when LHS is larger than RHS. Should use tp->get_rhs().Bytes() instead.

Confidence Score: 1/5

Critical memory safety bug makes this PR unsafe to merge
While the tensor.h changes are correct and the optimization approach is sound, there is a critical logic error in base_operator.h:212 that uses the wrong size parameter in cudaMemcpyAsync. When LHS is larger than RHS, the code will copy LHS.Bytes() from RHS memory, causing it to read past the end of RHS's allocated memory. This will lead to undefined behavior, potential crashes, or memory corruption.
include/matx/operators/base_operator.h requires immediate attention to fix the memory copy size bug on line 212

Important Files Changed

File Analysis

Filename	Score	Overview
include/matx/core/tensor.h	5/5	Fixed preprocessor macro from CUDA_VERSION to CUDART_VERSION for correct CUDA runtime version detection
include/matx/operators/base_operator.h	1/5	Added cudaMemcpyAsync optimization for contiguous tensor-to-tensor copies, but has critical bug copying wrong number of bytes

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOperator
    participant CUDARuntime
    participant TensorLHS
    participant TensorRHS

    User->>BaseOperator: Execute tensor assignment (LHS = RHS)
    BaseOperator->>BaseOperator: Check if both are tensor views
    BaseOperator->>BaseOperator: Check if using CUDA executor
    
    alt Both tensors contiguous
        BaseOperator->>BaseOperator: Check aliased memory
        BaseOperator->>TensorLHS: IsContiguous()?
        TensorLHS-->>BaseOperator: true
        BaseOperator->>TensorRHS: IsContiguous()?
        TensorRHS-->>BaseOperator: true
        BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
        BaseOperator->>CUDARuntime: cudaMemcpyAsync(LHS.Data(), RHS.Data(), size, stream)
        Note over BaseOperator,CUDARuntime: Optimization: Direct memcpy instead of kernel
        CUDARuntime-->>BaseOperator: Async copy initiated
    else Either tensor non-contiguous
        BaseOperator->>CUDARuntime: Launch kernel (ex.Exec)
        Note over BaseOperator,CUDARuntime: Fallback: Use element-wise copy kernel
        CUDARuntime-->>BaseOperator: Kernel launched
    end
    
    BaseOperator-->>User: Operation complete

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

include/matx/operators/base_operator.h

cliffburdick · 2025-11-06T21:57:00Z

/build

greptile-apps

Greptile Overview

Greptile Summary

This PR optimizes tensor-to-tensor assignment by using cudaMemcpyAsync instead of launching a CUDA kernel when both tensors are contiguous.

Key Changes

Added fast path for contiguous tensor-to-tensor copies in base_operator.h:200-220
Includes proper aliasing checks and size validation before the optimization
Falls back to kernel-based approach for non-contiguous tensors

Issues Addressed

The critical logic issue (using LHS bytes in copy) was already fixed. One minor style issue remains with the log message showing LHS bytes instead of RHS bytes on line 208.

Confidence Score: 4/5

This PR is safe to merge with the logic fix already applied
The critical bug (using LHS bytes for copy size) was already fixed in the latest commit. The optimization is well-guarded with aliasing checks and size validation. Only a minor style issue remains with the log message.
No files require special attention - the logic fix has been applied

Important Files Changed

File Analysis

Filename	Score	Overview
include/matx/operators/base_operator.h	4/5	Adds optimization to use `cudaMemcpyAsync` for contiguous tensor-to-tensor assignments instead of launching a kernel. The logic correctly uses RHS bytes for the copy size, but log message still shows LHS bytes (style issue).

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOp
    participant SetOp
    participant CudaExecutor
    participant CUDA

    User->>SetOp: tensor_lhs = tensor_rhs
    SetOp->>BaseOp: run(executor)
    BaseOp->>BaseOp: Check if tensor-to-tensor assignment
    
    alt Both tensors are tensor views & CUDA executor
        BaseOp->>BaseOp: check_aliased_memory()
        alt Memory is aliased
            BaseOp-->>User: THROW matxInvalidParameter
        end
        
        alt Both contiguous
            BaseOp->>BaseOp: Assert LHS.Bytes() >= RHS.Bytes()
            BaseOp->>BaseOp: LOG: Copying with cudaMemcpyAsync
            BaseOp->>CUDA: cudaMemcpyAsync(lhs, rhs, RHS.Bytes())
            CUDA-->>BaseOp: async copy queued
        else Non-contiguous
            BaseOp->>BaseOp: LOG: Copying with kernel
            BaseOp->>CudaExecutor: Exec(set_op)
            CudaExecutor->>CUDA: Launch kernel
            CUDA-->>CudaExecutor: kernel queued
        end
    else Transform or other operators
        BaseOp->>BaseOp: check_aliased_memory()
        BaseOp->>BaseOp: PreRun()
        BaseOp->>CudaExecutor: Exec(set_op)
        CudaExecutor->>CUDA: Launch kernel
        BaseOp->>BaseOp: PostRun()
    end
    
    BaseOp-->>User: operation queued

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cliffburdick · 2025-11-07T02:52:13Z

/build

cliffburdick · 2025-11-07T17:05:22Z

/build

cliffburdick · 2025-11-07T22:41:15Z

/build

cliffburdick · 2025-11-07T23:04:52Z

/build

cliffburdick · 2025-11-07T23:16:36Z

/build

cliffburdick · 2025-11-08T16:26:34Z

/build

cliffburdick · 2025-11-08T16:32:25Z

/build

cliffburdick · 2025-11-09T16:00:25Z

/build

cliffburdick · 2025-11-10T15:10:37Z

/build

cliffburdick · 2025-11-10T17:09:04Z

/build

cliffburdick · 2025-11-12T18:12:36Z

/build

greptile-apps · 2025-11-12T18:14:52Z

Greptile Overview

Greptile Summary

This PR optimizes tensor-to-tensor assignments by using cudaMemcpyAsync instead of launching a custom kernel when both tensors are contiguous and have the same rank. It also fixes an incorrect macro usage in tensor.h (CUDA_VERSION → CUDART_VERSION).

Key changes:

Added fast path for contiguous tensor assignments using cudaMemcpyAsync in base_operator.h:200-220
Fixed macro from CUDA_VERSION to CUDART_VERSION in tensor.h:744,768 for consistency with codebase

Critical issues found:

Previous reviewers correctly identified a memory safety bug on line 212 where the wrong byte count could cause memory corruption
Log message on line 209 shows incorrect byte count (LHS instead of RHS)

Confidence Score: 3/5

This PR has critical bugs already flagged by previous reviewers that must be fixed before merging
The optimization is well-intentioned and the CUDART_VERSION fix is correct, but the previous reviewers identified real bugs: (1) using LHS bytes in a log message when it should be RHS bytes, and (2) more critically, the code already correctly uses RHS.Bytes() in the cudaMemcpyAsync call on line 212, which contradicts the previous comment claiming it's a bug. After re-examining, the code appears correct on line 212 - it uses tp->get_rhs().Bytes() which is the right value. However, line 209's log message incorrectly shows tp->get_lhs().Bytes() when it should show the actual copy size.
Pay close attention to include/matx/operators/base_operator.h lines 207-214 for the memory copy size logic and logging accuracy

Important Files Changed

File Analysis

Filename	Score	Overview
include/matx/operators/base_operator.h	3/5	Added cudaMemcpyAsync optimization for contiguous tensor-to-tensor assignments, but has a critical bug with memory copy size and a logging inconsistency
include/matx/core/tensor.h	5/5	Fixed CUDA_VERSION to CUDART_VERSION macro for consistency with rest of codebase

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOperator
    participant Executor
    participant CUDA

    User->>BaseOperator: Tensor assignment (LHS = RHS)
    BaseOperator->>BaseOperator: Check if both are tensor views
    BaseOperator->>BaseOperator: Check aliased memory
    alt Tensors are contiguous and same rank
        BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
        BaseOperator->>CUDA: cudaMemcpyAsync(LHS.Data(), RHS.Data(), RHS.Bytes())
        CUDA-->>BaseOperator: Async copy initiated
    else Tensors not contiguous or different ranks
        BaseOperator->>Executor: ex.Exec(*tp) (use kernel)
        Executor->>CUDA: Launch custom kernel
        CUDA-->>Executor: Kernel execution
    end

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

include/matx/operators/base_operator.h Outdated Show resolved Hide resolved

include/matx/operators/base_operator.h Show resolved Hide resolved

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

cliffburdick force-pushed the copy_cudamemcpy branch from 61e3904 to c3e418b Compare November 12, 2025 18:12

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

Use cudaMemcpyAsync rather than kernel when possible

8bcfd5e

cliffburdick force-pushed the copy_cudamemcpy branch from c3e418b to 8bcfd5e Compare November 12, 2025 18:18

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

cliffburdick merged commit f4117f3 into main Nov 12, 2025

cliffburdick deleted the copy_cudamemcpy branch November 12, 2025 20:25

Use cudaMemcpyAsync rather than kernel when possible #1088

Use cudaMemcpyAsync rather than kernel when possible #1088

Uh oh!

Conversation

cliffburdick commented Nov 6, 2025

Uh oh!

copy-pr-bot bot commented Nov 6, 2025

Uh oh!

cliffburdick commented Nov 6, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

cliffburdick commented Nov 6, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Issues Addressed

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

cliffburdick commented Nov 7, 2025

Uh oh!

cliffburdick commented Nov 7, 2025

Uh oh!

cliffburdick commented Nov 7, 2025

Uh oh!

cliffburdick commented Nov 7, 2025

Uh oh!

cliffburdick commented Nov 7, 2025

Uh oh!

cliffburdick commented Nov 8, 2025

Uh oh!

cliffburdick commented Nov 8, 2025

Uh oh!

cliffburdick commented Nov 9, 2025

Uh oh!

cliffburdick commented Nov 10, 2025

Uh oh!

cliffburdick commented Nov 10, 2025

Uh oh!

cliffburdick commented Nov 12, 2025

Uh oh!

greptile-apps bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Nov 12, 2025 •

edited

Loading