-
Notifications
You must be signed in to change notification settings - Fork 469
Add AMD GPU node for integration test #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126 | ||
mkdir artifacts-to-be-uploaded | ||
python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title says 4 GPUs, but 8GPU is used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be on 8gpu, forgot the update the PR title
Co-authored-by: Jithun Nair <[email protected]>
test change
Hi @mori360, is this PR duplicating work happening on #1260? Our dev is working on that PR to bring up ROCm CI for torchtitan. |
Yeah, similar work here. We were blocked before and now fixing it. |
gpu-arch-type: rocm | ||
gpu-arch-version: "6.3" | ||
upload-artifact: outputs | ||
use-custom-docker-registry: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I followed the CUDA yml you linked to set the yaml here.
Do you have any suggestions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion here would be to keep the workflow consistent with how we do it for CUDA i.e. build a docker image with the required dependencies installed and then use that docker image to run the tests. This is the approach we are following in #1260.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also read my comment #1241 (review) to understand how we view the work in this PR as being complimentary to the work in #1260
mkdir artifacts-to-be-uploaded | ||
mkdir generated-artifacts | ||
python ./tests/integration_tests_amd.py generated-artifacts --ngpu 8 | ||
cp -r generated-artifacts/* artifacts-to-be-uploaded/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this line and the mkdir artifacts-to-be-uploaded
command above, to skip the upload. The upload step only happens if the folder exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how valuable is the resource? If we can afford running more, we can test other features such as FlexAttention, per op SAC, checkpointing, etc.
|
||
|
||
@dataclass | ||
class OverrideDefinitions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse this class in other tests, instead of reinvent?
tests/integration_tests_amd.py
Outdated
key is the config file name and value is a list of OverrideDefinitions | ||
that is used to generate variations of integration tests based on the | ||
same root config file. | ||
TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: 8*amd gpu current only support 1D TP/DP/CP test, ebale tests for PP | |
TODO: 8*amd gpu current only support 1D TP/DP/CP test, enable tests for PP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 2D test not available at all?
tests/integration_tests_amd.py
Outdated
"--parallelism.tensor_parallel_degree 2", | ||
], | ||
], | ||
"TP compile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks this is FSDP+TP 2D compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mori360 From what I understand, there are two different integration test files already existing in the torchtitan repo: integration_tests.py and integration_tests_h100.py, and these two contain very different set of tests. However, both of these seem to be relevant for GPUs and are run on Nvidia hardware here and here respectively. So I believe both sets of tests are relevant to be run on AMD hardware.
In #1260, we are focusing on enabling the first set of tests in the same workflow, and adding skip logic for the few tests that do not run successfully on AMD hardware.
In the current PR, the focus seems to be on the second set of tests, but the approach being taken is to create a separate copy of the tests to be run on AMD hardware. This is fine if the intention/expectation is that the AMD tests will look very different from the H100 tests. However, as reported by our developer, the integration_tests_h100.py
runs successfully on AMD hardware without any modifications, so I'm not sure if creating a different copy of the tests is required.
Add AMD gpu for integration test.
TODO: fixing amd gpu runner issue, test the capabilities to device the tests on AMD