Skip to content

Conversation

@cliffburdick
Copy link
Collaborator

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cliffburdick
Copy link
Collaborator Author

/build

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Optimizes tensor-to-tensor assignment operations by using cudaMemcpyAsync instead of launching a kernel when both tensors are contiguous, and fixes incorrect preprocessor macro usage in tensor prefetch methods.

Key changes:

  • Fixed CUDA_VERSION to CUDART_VERSION in tensor.h:744 and tensor.h:768 for correct CUDA runtime version detection
  • Added optimization path in base_operator.h that detects contiguous tensor-to-tensor assignments and uses cudaMemcpyAsync instead of kernel execution
  • Includes aliased memory checking before the optimized copy
  • Falls back to kernel execution for non-contiguous tensors

Issues found:

  • Critical bug in base_operator.h:212: uses tp->get_lhs().Bytes() as the copy size, which will read past the end of RHS memory when LHS is larger than RHS. Should use tp->get_rhs().Bytes() instead.

Confidence Score: 1/5

  • Critical memory safety bug makes this PR unsafe to merge
  • While the tensor.h changes are correct and the optimization approach is sound, there is a critical logic error in base_operator.h:212 that uses the wrong size parameter in cudaMemcpyAsync. When LHS is larger than RHS, the code will copy LHS.Bytes() from RHS memory, causing it to read past the end of RHS's allocated memory. This will lead to undefined behavior, potential crashes, or memory corruption.
  • include/matx/operators/base_operator.h requires immediate attention to fix the memory copy size bug on line 212

Important Files Changed

File Analysis

Filename Score Overview
include/matx/core/tensor.h 5/5 Fixed preprocessor macro from CUDA_VERSION to CUDART_VERSION for correct CUDA runtime version detection
include/matx/operators/base_operator.h 1/5 Added cudaMemcpyAsync optimization for contiguous tensor-to-tensor copies, but has critical bug copying wrong number of bytes

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOperator
    participant CUDARuntime
    participant TensorLHS
    participant TensorRHS

    User->>BaseOperator: Execute tensor assignment (LHS = RHS)
    BaseOperator->>BaseOperator: Check if both are tensor views
    BaseOperator->>BaseOperator: Check if using CUDA executor
    
    alt Both tensors contiguous
        BaseOperator->>BaseOperator: Check aliased memory
        BaseOperator->>TensorLHS: IsContiguous()?
        TensorLHS-->>BaseOperator: true
        BaseOperator->>TensorRHS: IsContiguous()?
        TensorRHS-->>BaseOperator: true
        BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
        BaseOperator->>CUDARuntime: cudaMemcpyAsync(LHS.Data(), RHS.Data(), size, stream)
        Note over BaseOperator,CUDARuntime: Optimization: Direct memcpy instead of kernel
        CUDARuntime-->>BaseOperator: Async copy initiated
    else Either tensor non-contiguous
        BaseOperator->>CUDARuntime: Launch kernel (ex.Exec)
        Note over BaseOperator,CUDARuntime: Fallback: Use element-wise copy kernel
        CUDARuntime-->>BaseOperator: Kernel launched
    end
    
    BaseOperator-->>User: Operation complete
Loading

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@cliffburdick
Copy link
Collaborator Author

/build

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR optimizes tensor-to-tensor assignment by using cudaMemcpyAsync instead of launching a CUDA kernel when both tensors are contiguous.

Key Changes

  • Added fast path for contiguous tensor-to-tensor copies in base_operator.h:200-220
  • Includes proper aliasing checks and size validation before the optimization
  • Falls back to kernel-based approach for non-contiguous tensors

Issues Addressed

The critical logic issue (using LHS bytes in copy) was already fixed. One minor style issue remains with the log message showing LHS bytes instead of RHS bytes on line 208.

Confidence Score: 4/5

  • This PR is safe to merge with the logic fix already applied
  • The critical bug (using LHS bytes for copy size) was already fixed in the latest commit. The optimization is well-guarded with aliasing checks and size validation. Only a minor style issue remains with the log message.
  • No files require special attention - the logic fix has been applied

Important Files Changed

File Analysis

Filename Score Overview
include/matx/operators/base_operator.h 4/5 Adds optimization to use cudaMemcpyAsync for contiguous tensor-to-tensor assignments instead of launching a kernel. The logic correctly uses RHS bytes for the copy size, but log message still shows LHS bytes (style issue).

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOp
    participant SetOp
    participant CudaExecutor
    participant CUDA

    User->>SetOp: tensor_lhs = tensor_rhs
    SetOp->>BaseOp: run(executor)
    BaseOp->>BaseOp: Check if tensor-to-tensor assignment
    
    alt Both tensors are tensor views & CUDA executor
        BaseOp->>BaseOp: check_aliased_memory()
        alt Memory is aliased
            BaseOp-->>User: THROW matxInvalidParameter
        end
        
        alt Both contiguous
            BaseOp->>BaseOp: Assert LHS.Bytes() >= RHS.Bytes()
            BaseOp->>BaseOp: LOG: Copying with cudaMemcpyAsync
            BaseOp->>CUDA: cudaMemcpyAsync(lhs, rhs, RHS.Bytes())
            CUDA-->>BaseOp: async copy queued
        else Non-contiguous
            BaseOp->>BaseOp: LOG: Copying with kernel
            BaseOp->>CudaExecutor: Exec(set_op)
            CudaExecutor->>CUDA: Launch kernel
            CUDA-->>CudaExecutor: kernel queued
        end
    else Transform or other operators
        BaseOp->>BaseOp: check_aliased_memory()
        BaseOp->>BaseOp: PreRun()
        BaseOp->>CudaExecutor: Exec(set_op)
        CudaExecutor->>CUDA: Launch kernel
        BaseOp->>BaseOp: PostRun()
    end
    
    BaseOp-->>User: operation queued
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cliffburdick
Copy link
Collaborator Author

/build

9 similar comments
@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick
Copy link
Collaborator Author

/build

@greptile-apps
Copy link

greptile-apps bot commented Nov 12, 2025

Greptile Overview

Greptile Summary

This PR optimizes tensor-to-tensor assignments by using cudaMemcpyAsync instead of launching a custom kernel when both tensors are contiguous and have the same rank. It also fixes an incorrect macro usage in tensor.h (CUDA_VERSION → CUDART_VERSION).

Key changes:

  • Added fast path for contiguous tensor assignments using cudaMemcpyAsync in base_operator.h:200-220
  • Fixed macro from CUDA_VERSION to CUDART_VERSION in tensor.h:744,768 for consistency with codebase

Critical issues found:

  • Previous reviewers correctly identified a memory safety bug on line 212 where the wrong byte count could cause memory corruption
  • Log message on line 209 shows incorrect byte count (LHS instead of RHS)

Confidence Score: 3/5

  • This PR has critical bugs already flagged by previous reviewers that must be fixed before merging
  • The optimization is well-intentioned and the CUDART_VERSION fix is correct, but the previous reviewers identified real bugs: (1) using LHS bytes in a log message when it should be RHS bytes, and (2) more critically, the code already correctly uses RHS.Bytes() in the cudaMemcpyAsync call on line 212, which contradicts the previous comment claiming it's a bug. After re-examining, the code appears correct on line 212 - it uses tp->get_rhs().Bytes() which is the right value. However, line 209's log message incorrectly shows tp->get_lhs().Bytes() when it should show the actual copy size.
  • Pay close attention to include/matx/operators/base_operator.h lines 207-214 for the memory copy size logic and logging accuracy

Important Files Changed

File Analysis

Filename Score Overview
include/matx/operators/base_operator.h 3/5 Added cudaMemcpyAsync optimization for contiguous tensor-to-tensor assignments, but has a critical bug with memory copy size and a logging inconsistency
include/matx/core/tensor.h 5/5 Fixed CUDA_VERSION to CUDART_VERSION macro for consistency with rest of codebase

Sequence Diagram

sequenceDiagram
    participant User
    participant BaseOperator
    participant Executor
    participant CUDA

    User->>BaseOperator: Tensor assignment (LHS = RHS)
    BaseOperator->>BaseOperator: Check if both are tensor views
    BaseOperator->>BaseOperator: Check aliased memory
    alt Tensors are contiguous and same rank
        BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
        BaseOperator->>CUDA: cudaMemcpyAsync(LHS.Data(), RHS.Data(), RHS.Bytes())
        CUDA-->>BaseOperator: Async copy initiated
    else Tensors not contiguous or different ranks
        BaseOperator->>Executor: ex.Exec(*tp) (use kernel)
        Executor->>CUDA: Launch custom kernel
        CUDA-->>Executor: Kernel execution
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@cliffburdick cliffburdick merged commit f4117f3 into main Nov 12, 2025
@cliffburdick cliffburdick deleted the copy_cudamemcpy branch November 12, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants