Skip to content

Add AMD GPU node for integration test #1241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

mori360
Copy link
Contributor

@mori360 mori360 commented May 29, 2025

Add AMD gpu for integration test.

TODO: fixing amd gpu runner issue, test the capabilities to device the tests on AMD

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2025
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
mkdir artifacts-to-be-uploaded
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title says 4 GPUs, but 8GPU is used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be on 8gpu, forgot the update the PR title

@mori360 mori360 changed the title Add integration_test_4gpu_amd.yaml Add integration_test_8gpu_amd.yaml May 30, 2025
@mori360 mori360 changed the title Add integration_test_8gpu_amd.yaml Add AMD GPU node for integration test May 30, 2025
@jithunnair-amd
Copy link

Hi @mori360, is this PR duplicating work happening on #1260? Our dev is working on that PR to bring up ROCm CI for torchtitan.
cc @akashveramd

@mori360
Copy link
Contributor Author

mori360 commented Jun 25, 2025

Hi @mori360, is this PR duplicating work happening on #1260? Our dev is working on that PR to bring up ROCm CI for torchtitan. cc @akashveramd

Yeah, similar work here. We were blocked before and now fixing it.
@seemethere Do you have any suggestions?

gpu-arch-type: rocm
gpu-arch-version: "6.3"
upload-artifact: outputs
use-custom-docker-registry: false
Copy link

@jithunnair-amd jithunnair-amd Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mori360 It looks like you just plan to use the pytorch/manylinux2_28-builder:rocm6.3 image for ROCm and just install the requirements as part of the steps below. Is that the plan for the CUDA yml file as well? Otherwise, we have two inconsistent flows here for ROCm and CUDA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I followed the CUDA yml you linked to set the yaml here.
Do you have any suggestions here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here would be to keep the workflow consistent with how we do it for CUDA i.e. build a docker image with the required dependencies installed and then use that docker image to run the tests. This is the approach we are following in #1260.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also read my comment #1241 (review) to understand how we view the work in this PR as being complimentary to the work in #1260

mkdir artifacts-to-be-uploaded
mkdir generated-artifacts
python ./tests/integration_tests_amd.py generated-artifacts --ngpu 8
cp -r generated-artifacts/* artifacts-to-be-uploaded/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line and the mkdir artifacts-to-be-uploaded command above, to skip the upload. The upload step only happens if the folder exists

@mori360 mori360 marked this pull request as ready for review July 8, 2025 23:27
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how valuable is the resource? If we can afford running more, we can test other features such as FlexAttention, per op SAC, checkpointing, etc.



@dataclass
class OverrideDefinitions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse this class in other tests, instead of reinvent?

key is the config file name and value is a list of OverrideDefinitions
that is used to generate variations of integration tests based on the
same root config file.
TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP
TODO: 8*amd gpu current only support 1D TP/DP/CP test, enable tests for PP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2D test not available at all?

"--parallelism.tensor_parallel_degree 2",
],
],
"TP compile",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks this is FSDP+TP 2D compile?

@mori360 mori360 marked this pull request as draft July 9, 2025 02:47
Copy link

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mori360 From what I understand, there are two different integration test files already existing in the torchtitan repo: integration_tests.py and integration_tests_h100.py, and these two contain very different set of tests. However, both of these seem to be relevant for GPUs and are run on Nvidia hardware here and here respectively. So I believe both sets of tests are relevant to be run on AMD hardware.

In #1260, we are focusing on enabling the first set of tests in the same workflow, and adding skip logic for the few tests that do not run successfully on AMD hardware.

In the current PR, the focus seems to be on the second set of tests, but the approach being taken is to create a separate copy of the tests to be run on AMD hardware. This is fine if the intention/expectation is that the AMD tests will look very different from the H100 tests. However, as reported by our developer, the integration_tests_h100.py runs successfully on AMD hardware without any modifications, so I'm not sure if creating a different copy of the tests is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants