Skip to content

Enable ROCm CI support. #1260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
4074d0c
Added support to run torchtitan tests on ROCm.
akashveramd Jun 4, 2025
340478a
Added rocm ci support for integration_test_h100.
akashveramd Jun 5, 2025
51427a7
Fixed a bug in build script. Removed ubuntu-cuda folder, instead usin…
akashveramd Jun 7, 2025
2848d51
Included test in integration_tests.py after rebase.
akashveramd Jun 11, 2025
cb13ad4
Modified docker-builds.yml to build rocm docker image for torchtitan.
akashveramd Jun 13, 2025
de9bdcc
Fixed runner for cuda and rocm images in docker-builds.yml.
akashveramd Jun 18, 2025
f634f00
Added TEST_WITH_ROCM environment variable for running tests on rocm. …
akashveramd Jun 19, 2025
87a5a59
Refactored integration_tests.py with skip tests for ROCm.
akashveramd Jun 24, 2025
d748586
Changed runner to i-0962598bd0e8298b3 for building ROCm docker image.
akashveramd Jun 29, 2025
66eba9f
Changed runner to linux.12xlarge for building ROCm docker image.
akashveramd Jun 30, 2025
2d317c3
Changed runner to linux.2xlarge for building ROCm docker image.
akashveramd Jun 30, 2025
18025ad
Added support to use single Dockerfile for both cuda and rocm. Using …
akashveramd Jul 3, 2025
724e202
Changed rocm docker image name in docker-builds.yml.
akashveramd Jul 3, 2025
15a9554
Reverted the changes to integration_test_8gpu_h100.yaml.
akashveramd Jul 9, 2025
cb528bc
Empty dummy commit.
akashveramd Jul 16, 2025
66e5c95
Increased the timeout to 45 minutes to override timeout used in linux…
akashveramd Jul 17, 2025
efd11a8
Empty dummy commit.
akashveramd Jul 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .ci/docker/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,20 @@ shift
echo "Building ${IMAGE_NAME} Docker image"

OS=ubuntu
OS_VERSION=20.04
CLANG_VERSION=""
PYTHON_VERSION=3.11
MINICONDA_VERSION=24.3.0-0

case "${IMAGE_NAME}" in
torchtitan-ubuntu-20.04-clang12)
OS_VERSION=20.04
CLANG_VERSION=12
BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-runtime-ubuntu${OS_VERSION}
;;
torchtitan-rocm-ubuntu-22.04-clang12)
OS_VERSION=22.04
CLANG_VERSION=12
BASE_IMAGE=rocm/dev-ubuntu-${OS_VERSION}:latest
;;
*)
echo "Invalid image name ${IMAGE_NAME}"
Expand All @@ -30,6 +36,7 @@ esac
docker build \
--no-cache \
--progress=plain \
--build-arg "BASE_IMAGE=${BASE_IMAGE}" \
--build-arg "OS_VERSION=${OS_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
Expand All @@ -38,3 +45,4 @@ docker build \
-f "${OS}"/Dockerfile \
"$@" \
.

4 changes: 2 additions & 2 deletions .ci/docker/ubuntu/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ARG OS_VERSION
ARG BASE_IMAGE

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu${OS_VERSION}
FROM ${BASE_IMAGE}

ARG OS_VERSION

Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/docker-builds.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,16 @@ concurrency:

jobs:
docker-build:
runs-on: [self-hosted, linux.2xlarge]
timeout-minutes: 240
strategy:
fail-fast: false
matrix:
include:
- docker-image-name: torchtitan-ubuntu-20.04-clang12
runner: [self-hosted, linux.2xlarge]
- docker-image-name: torchtitan-rocm-ubuntu-22.04-clang12
runner: [linux.2xlarge]
runs-on: ${{ matrix.runner }}
timeout-minutes: 240
env:
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/torchtitan/${{ matrix.docker-image-name }}
steps:
Expand Down
36 changes: 27 additions & 9 deletions .github/workflows/integration_test_8gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,33 @@ defaults:
jobs:
build-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
strategy:
matrix:
include:
- name: cuda
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.6"
# This image is faster to clone than the default, but it lacks CC needed by triton
# (1m25s vs 2m37s).
docker-image: torchtitan-ubuntu-20.04-clang12
index-url: https://download.pytorch.org/whl/nightly/cu126
is-rocm: 0
- name: rocm
runner: linux.rocm.gpu.mi300.8
gpu-arch-type: rocm
gpu-arch-version: "6.4"
docker-image: torchtitan-rocm-ubuntu-22.04-clang12
index-url: https://download.pytorch.org/whl/nightly/rocm6.4
is-rocm: 1
with:
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.6"
# This image is faster to clone than the default, but it lacks CC needed by triton
# (1m25s vs 2m37s).
docker-image: torchtitan-ubuntu-20.04-clang12
runner: ${{ matrix.runner }}
gpu-arch-type: ${{ matrix.gpu-arch-type }}
gpu-arch-version: ${{ matrix.gpu-arch-version }}
docker-image: ${{ matrix.docker-image }}
repository: pytorch/torchtitan
upload-artifact: outputs
timeout: 45
script: |
set -eux

Expand All @@ -41,9 +59,9 @@ jobs:

pip config --user set global.progress_bar off

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
python -m pip install --force-reinstall --pre torch --index-url ${{ matrix.index-url }}

USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
USE_CPP=0 python -m pip install --pre torchao --index-url ${{ matrix.index-url }}

mkdir artifacts-to-be-uploaded
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8
TEST_WITH_ROCM=${{ matrix.is-rocm }} python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8
15 changes: 15 additions & 0 deletions tests/integration_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@
except ModuleNotFoundError:
import tomli as tomllib

# tests skipped for ROCm
skip_for_rocm_test_list = [
"pp_looped_zero_bubble",
"pp_zbv",
"pp_custom_csv",
"last_save_model_weights_only_bf16",
"last_save_model_weights_only_fp32",
]
TEST_WITH_ROCM = os.getenv("TEST_WITH_ROCM", "0") == "1"


@dataclass
class OverrideDefinitions:
Expand Down Expand Up @@ -568,6 +578,11 @@ def run_tests(args):
)
if is_integration_test:
for test_flavor in integration_tests_flavors[config_file]:
if (
TEST_WITH_ROCM
and test_flavor.test_name in skip_for_rocm_test_list
):
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic makes sense to me, but if we really want to use the test setting in integration_tests_h100.py, we should move this logic to that file (and of course rename it to be more agnostic).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd : All tests in integration_tests_h100.py passes for rocm. Hence, we don't TEST_WITH_ROCM in integration_tests_h100.py. However, we need to talk about renaming integration_tests_h100.py filename as we also run it on rocm runners.
cc: @tianyu-l @fegin

if args.test == "all" or test_flavor.test_name == args.test:
if args.ngpu < test_flavor.ngpu:
logger.info(
Expand Down
Loading