-
Notifications
You must be signed in to change notification settings - Fork 108
Use cudaMemcpyAsync rather than kernel when possible #1088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Optimizes tensor-to-tensor assignment operations by using cudaMemcpyAsync instead of launching a kernel when both tensors are contiguous, and fixes incorrect preprocessor macro usage in tensor prefetch methods.
Key changes:
- Fixed
CUDA_VERSIONtoCUDART_VERSIONintensor.h:744andtensor.h:768for correct CUDA runtime version detection - Added optimization path in
base_operator.hthat detects contiguous tensor-to-tensor assignments and usescudaMemcpyAsyncinstead of kernel execution - Includes aliased memory checking before the optimized copy
- Falls back to kernel execution for non-contiguous tensors
Issues found:
- Critical bug in
base_operator.h:212: usestp->get_lhs().Bytes()as the copy size, which will read past the end of RHS memory when LHS is larger than RHS. Should usetp->get_rhs().Bytes()instead.
Confidence Score: 1/5
- Critical memory safety bug makes this PR unsafe to merge
- While the tensor.h changes are correct and the optimization approach is sound, there is a critical logic error in base_operator.h:212 that uses the wrong size parameter in cudaMemcpyAsync. When LHS is larger than RHS, the code will copy LHS.Bytes() from RHS memory, causing it to read past the end of RHS's allocated memory. This will lead to undefined behavior, potential crashes, or memory corruption.
- include/matx/operators/base_operator.h requires immediate attention to fix the memory copy size bug on line 212
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| include/matx/core/tensor.h | 5/5 | Fixed preprocessor macro from CUDA_VERSION to CUDART_VERSION for correct CUDA runtime version detection |
| include/matx/operators/base_operator.h | 1/5 | Added cudaMemcpyAsync optimization for contiguous tensor-to-tensor copies, but has critical bug copying wrong number of bytes |
Sequence Diagram
sequenceDiagram
participant User
participant BaseOperator
participant CUDARuntime
participant TensorLHS
participant TensorRHS
User->>BaseOperator: Execute tensor assignment (LHS = RHS)
BaseOperator->>BaseOperator: Check if both are tensor views
BaseOperator->>BaseOperator: Check if using CUDA executor
alt Both tensors contiguous
BaseOperator->>BaseOperator: Check aliased memory
BaseOperator->>TensorLHS: IsContiguous()?
TensorLHS-->>BaseOperator: true
BaseOperator->>TensorRHS: IsContiguous()?
TensorRHS-->>BaseOperator: true
BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
BaseOperator->>CUDARuntime: cudaMemcpyAsync(LHS.Data(), RHS.Data(), size, stream)
Note over BaseOperator,CUDARuntime: Optimization: Direct memcpy instead of kernel
CUDARuntime-->>BaseOperator: Async copy initiated
else Either tensor non-contiguous
BaseOperator->>CUDARuntime: Launch kernel (ex.Exec)
Note over BaseOperator,CUDARuntime: Fallback: Use element-wise copy kernel
CUDARuntime-->>BaseOperator: Kernel launched
end
BaseOperator-->>User: Operation complete
2 files reviewed, 2 comments
|
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR optimizes tensor-to-tensor assignment by using cudaMemcpyAsync instead of launching a CUDA kernel when both tensors are contiguous.
Key Changes
- Added fast path for contiguous tensor-to-tensor copies in
base_operator.h:200-220 - Includes proper aliasing checks and size validation before the optimization
- Falls back to kernel-based approach for non-contiguous tensors
Issues Addressed
The critical logic issue (using LHS bytes in copy) was already fixed. One minor style issue remains with the log message showing LHS bytes instead of RHS bytes on line 208.
Confidence Score: 4/5
- This PR is safe to merge with the logic fix already applied
- The critical bug (using LHS bytes for copy size) was already fixed in the latest commit. The optimization is well-guarded with aliasing checks and size validation. Only a minor style issue remains with the log message.
- No files require special attention - the logic fix has been applied
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| include/matx/operators/base_operator.h | 4/5 | Adds optimization to use cudaMemcpyAsync for contiguous tensor-to-tensor assignments instead of launching a kernel. The logic correctly uses RHS bytes for the copy size, but log message still shows LHS bytes (style issue). |
Sequence Diagram
sequenceDiagram
participant User
participant BaseOp
participant SetOp
participant CudaExecutor
participant CUDA
User->>SetOp: tensor_lhs = tensor_rhs
SetOp->>BaseOp: run(executor)
BaseOp->>BaseOp: Check if tensor-to-tensor assignment
alt Both tensors are tensor views & CUDA executor
BaseOp->>BaseOp: check_aliased_memory()
alt Memory is aliased
BaseOp-->>User: THROW matxInvalidParameter
end
alt Both contiguous
BaseOp->>BaseOp: Assert LHS.Bytes() >= RHS.Bytes()
BaseOp->>BaseOp: LOG: Copying with cudaMemcpyAsync
BaseOp->>CUDA: cudaMemcpyAsync(lhs, rhs, RHS.Bytes())
CUDA-->>BaseOp: async copy queued
else Non-contiguous
BaseOp->>BaseOp: LOG: Copying with kernel
BaseOp->>CudaExecutor: Exec(set_op)
CudaExecutor->>CUDA: Launch kernel
CUDA-->>CudaExecutor: kernel queued
end
else Transform or other operators
BaseOp->>BaseOp: check_aliased_memory()
BaseOp->>BaseOp: PreRun()
BaseOp->>CudaExecutor: Exec(set_op)
CudaExecutor->>CUDA: Launch kernel
BaseOp->>BaseOp: PostRun()
end
BaseOp-->>User: operation queued
1 file reviewed, no comments
|
/build |
9 similar comments
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
|
/build |
61e3904 to
c3e418b
Compare
|
/build |
Greptile OverviewGreptile SummaryThis PR optimizes tensor-to-tensor assignments by using Key changes:
Critical issues found:
Confidence Score: 3/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant BaseOperator
participant Executor
participant CUDA
User->>BaseOperator: Tensor assignment (LHS = RHS)
BaseOperator->>BaseOperator: Check if both are tensor views
BaseOperator->>BaseOperator: Check aliased memory
alt Tensors are contiguous and same rank
BaseOperator->>BaseOperator: Validate LHS.Bytes() >= RHS.Bytes()
BaseOperator->>CUDA: cudaMemcpyAsync(LHS.Data(), RHS.Data(), RHS.Bytes())
CUDA-->>BaseOperator: Async copy initiated
else Tensors not contiguous or different ranks
BaseOperator->>Executor: ex.Exec(*tp) (use kernel)
Executor->>CUDA: Launch custom kernel
CUDA-->>Executor: Kernel execution
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
c3e418b to
8bcfd5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
No description provided.