Add AMD GPU node for integration test #1241

mori360 · 2025-05-29T20:01:24Z

Add AMD gpu for integration test.

TODO: fixing amd gpu runner issue, test the capabilities to device the tests on AMD

.github/workflows/integration_test_4gpu_amd.yaml

fegin · 2025-05-30T00:29:57Z

.github/workflows/integration_test_8gpu_amd.yaml

+        python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
+        USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+        mkdir artifacts-to-be-uploaded
+        python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8


The PR title says 4 GPUs, but 8GPU is used here.

Yeah, it should be on 8gpu, forgot the update the PR title

.github/workflows/integration_test_8gpu_amd.yaml

Co-authored-by: Jithun Nair <[email protected]>

.github/workflows/integration_test_8gpu_amd.yaml

test change

jithunnair-amd · 2025-06-25T18:22:00Z

Hi @mori360, is this PR duplicating work happening on #1260? Our dev is working on that PR to bring up ROCm CI for torchtitan.
cc @akashveramd

mori360 · 2025-06-25T18:45:56Z

Hi @mori360, is this PR duplicating work happening on #1260? Our dev is working on that PR to bring up ROCm CI for torchtitan. cc @akashveramd

Yeah, similar work here. We were blocked before and now fixing it.
@seemethere Do you have any suggestions?

jithunnair-amd · 2025-06-26T19:34:01Z

.github/workflows/integration_test_8gpu_amd.yaml

+      gpu-arch-type: rocm
+      gpu-arch-version: "6.3"
+      upload-artifact: outputs
+      use-custom-docker-registry: false


@mori360 It looks like you just plan to use the pytorch/manylinux2_28-builder:rocm6.3 image for ROCm and just install the requirements as part of the steps below. Is that the plan for the CUDA yml file as well? Otherwise, we have two inconsistent flows here for ROCm and CUDA.

Yeah, I followed the CUDA yml you linked to set the yaml here.
Do you have any suggestions here?

My suggestion here would be to keep the workflow consistent with how we do it for CUDA i.e. build a docker image with the required dependencies installed and then use that docker image to run the tests. This is the approach we are following in #1260.

Please also read my comment #1241 (review) to understand how we view the work in this PR as being complimentary to the work in #1260

ZainRizvi · 2025-07-08T16:11:27Z

.github/workflows/integration_test_8gpu_amd.yaml

+        mkdir artifacts-to-be-uploaded
+        mkdir generated-artifacts
+        python ./tests/integration_tests_amd.py generated-artifacts --ngpu 8
+        cp -r generated-artifacts/* artifacts-to-be-uploaded/


delete this line and the mkdir artifacts-to-be-uploaded command above, to skip the upload. The upload step only happens if the folder exists

tianyu-l

how valuable is the resource? If we can afford running more, we can test other features such as FlexAttention, per op SAC, checkpointing, etc.

tianyu-l · 2025-07-09T00:44:52Z

tests/integration_tests_amd.py

+
+
+@dataclass
+class OverrideDefinitions:


can we reuse this class in other tests, instead of reinvent?

tianyu-l · 2025-07-09T00:45:28Z

tests/integration_tests_amd.py

+    key is the config file name and value is a list of OverrideDefinitions
+    that is used to generate variations of integration tests based on the
+    same root config file.
+    TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP


Suggested change

TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP

TODO: 8*amd gpu current only support 1D TP/DP/CP test, enable tests for PP

Is 2D test not available at all?

tianyu-l · 2025-07-09T00:46:07Z

tests/integration_tests_amd.py

+                    "--parallelism.tensor_parallel_degree 2",
+                ],
+            ],
+            "TP compile",


It looks this is FSDP+TP 2D compile?

jithunnair-amd

@mori360 From what I understand, there are two different integration test files already existing in the torchtitan repo: integration_tests.py and integration_tests_h100.py, and these two contain very different set of tests. However, both of these seem to be relevant for GPUs and are run on Nvidia hardware here and here respectively. So I believe both sets of tests are relevant to be run on AMD hardware.

In #1260, we are focusing on enabling the first set of tests in the same workflow, and adding skip logic for the few tests that do not run successfully on AMD hardware.

In the current PR, the focus seems to be on the second set of tests, but the approach being taken is to create a separate copy of the tests to be run on AMD hardware. This is fine if the intention/expectation is that the AMD tests will look very different from the H100 tests. However, as reported by our developer, the integration_tests_h100.py runs successfully on AMD hardware without any modifications, so I'm not sure if creating a different copy of the tests is required.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2025

seemethere reviewed May 29, 2025

View reviewed changes

.github/workflows/integration_test_4gpu_amd.yaml Outdated Show resolved Hide resolved

fegin reviewed May 30, 2025

View reviewed changes

mori360 changed the title ~~Add integration_test_4gpu_amd.yaml~~ Add integration_test_8gpu_amd.yaml May 30, 2025

mori360 changed the title ~~Add integration_test_8gpu_amd.yaml~~ Add AMD GPU node for integration test May 30, 2025

mori360 and others added 5 commits June 2, 2025 10:32

add integration_test_4gpu_amd.yaml

d47f51f

Update .github/workflows/integration_test_4gpu_amd.yaml

ec83247

correct gpu number

7cd90e3

typo

2bfcf35

update job version

b130aee

mori360 force-pushed the add_amd_test branch from 0c0ec83 to b130aee Compare June 2, 2025 17:33

jithunnair-amd reviewed Jun 13, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Outdated Show resolved Hide resolved

Update .github/workflows/integration_test_8gpu_amd.yaml

e26285c

Co-authored-by: Jithun Nair <[email protected]>

seemethere reviewed Jun 13, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Outdated Show resolved Hide resolved

Update .github/workflows/integration_test_8gpu_amd.yaml

9e8e997

seemethere reviewed Jun 13, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Outdated Show resolved Hide resolved

Remove custom docker image

c9e733a

seemethere reviewed Jun 13, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Outdated Show resolved Hide resolved

seemethere and others added 4 commits June 13, 2025 13:58

Update .github/workflows/integration_test_8gpu_amd.yaml

a9aa64d

Merge branch 'pytorch:main' into add_amd_test

41e3021

2 gpu

cf8a9f0

Update integration_test_8gpu_amd.yaml

47d9d61

ZainRizvi reviewed Jun 24, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Outdated Show resolved Hide resolved

Update .github/workflows/integration_test_8gpu_amd.yaml

7a4cc68

ZainRizvi reviewed Jun 25, 2025

View reviewed changes

.github/workflows/integration_test_8gpu_amd.yaml Show resolved Hide resolved

Apply suggestions from code review

ae636c5

test change

isntall jq

e4235c9

isntall jq

687fdf4

Update integration_tests_amd.py

237b955

jithunnair-amd reviewed Jun 26, 2025

View reviewed changes

mori360 added 9 commits June 26, 2025 13:25

Update integration_tests_amd.py

f4c53d7

Update integration_tests_amd.py

496ee04

Update integration_tests_amd.py

bc43677

change folder

cbbe7ac

limit test

e243a54

amend

f1830ea

use cp

d279713

Update integration_test_8gpu_amd.yaml

79c5253

Update integration_test_8gpu_amd.yaml

a532866

ZainRizvi reviewed Jul 8, 2025

View reviewed changes

remove artifacts-to-be-uploaded

d841f29

mori360 force-pushed the add_amd_test branch from aba9b3b to d841f29 Compare July 8, 2025 22:18

remove conda create

9e1ca9e

mori360 marked this pull request as ready for review July 8, 2025 23:27

mori360 requested review from tianyu-l, wwwjn and wconstab as code owners July 8, 2025 23:27

mori360 requested review from ZainRizvi, jithunnair-amd and seemethere July 8, 2025 23:27

tianyu-l reviewed Jul 9, 2025

View reviewed changes

mori360 marked this pull request as draft July 9, 2025 02:47

mori360 added 5 commits July 8, 2025 19:51

try other tests

470a7fa

update tests

ece8810

update test

7223212

update test

342a0d3

update test

fc280f5

jithunnair-amd reviewed Jul 10, 2025

View reviewed changes

	TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP
	TODO: 8*amd gpu current only support 1D TP/DP/CP test, enable tests for PP

Add AMD GPU node for integration test #1241

Are you sure you want to change the base?

Add AMD GPU node for integration test #1241

Uh oh!

Conversation

mori360 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jithunnair-amd commented Jun 25, 2025

Uh oh!

mori360 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mori360 commented May 29, 2025 •

edited

Loading

mori360 commented Jun 25, 2025 •

edited

Loading

jithunnair-amd Jun 26, 2025 •

edited

Loading