Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions Samples/3_CUDA_Features/globalToShmemTMACopy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# globalToShmemTMACopy - Global Memory to Shared Memory TMA Copy

## Description

This sample shows how to use the CUDA driver API and inline PTX assembly to copy
a 2D tile of a tensor into shared memory. It also demonstrates arrive-wait
barrier for synchronization.

## Key Concepts

CUDA Runtime API, CUDA Driver API, PTX ISA, CPP11 CUDA

## Supported SM Architectures

This sample requires compute capability 9.0 or higher.

[SM 9.0 ](https://developer.nvidia.com/cuda-gpus)

## Supported OSes

Linux, Windows, QNX

## Supported CPU Architecture

x86_64, ppc64le, armv7l, aarch64

## CUDA APIs involved

### [CUDA Runtime API](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

### [CUDA Driver API](http://docs.nvidia.com/cuda/cuda-driver-api/index.html)
cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize

### [CUDA PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)

## Dependencies needed to build/run
[CPP11](../../../README.md#cpp11)

## Prerequisites

Download and install the [CUDA Toolkit 12.2](https://developer.nvidia.com/cuda-downloads) for your corresponding platform.
Make sure the dependencies mentioned in [Dependencies]() section above are installed.

## Build and Run


## References (for more details)

Loading