Skip to content

Conversation

@KimbingNg
Copy link
Contributor

@KimbingNg KimbingNg commented Nov 26, 2025

What does this PR do?

Fixes # (12673) and Tencent-Hunyuan/HunyuanVideo-1.5#10 I identified the root cause: record_stream isn’t functioning as expected. This causes the offload_to_memory function to update param.data without waiting for the default stream (current_stream) to complete (the forward pass).

Updates [Nov 27]: The root cause!!
The current_stream() call inside _transfer_tensor_to_device is under with torch.cuda.stream(self.stream). So it is not returning the correct default stream to call record_stream on !

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @asomoza @sayakpaul @a-r-r-o-w @DN6

@KimbingNg
Copy link
Contributor Author

Also relevant to Tencent-Hunyuan/HunyuanVideo-1.5#10

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR!

Could we also run the test_group_offloading tests?

pytest tests/models -k "test_group_offloading"?

@KimbingNg
Copy link
Contributor Author

pytest tests/models -k "test_group_offloading"

Running pytest:

=============================================== 365 passed, 37 skipped, 3604 deselected, 757 warnings in 100.34s (0:01:40) ================================================

@KimbingNg KimbingNg changed the title Fixes #12673. record_stream is not working properly Fixes #12673. record_stream in group offloading is not working properly Nov 26, 2025
@sayakpaul sayakpaul requested a review from DN6 November 26, 2025 12:30
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @KimbingNg I think we might also need to make a similar change to _offload_to_disk.

    Wrong default_stream is used. leading to wrong execution order when record_steram is enabled.
@KimbingNg
Copy link
Contributor Author

KimbingNg commented Nov 27, 2025

@DN6 , I found the root cause!!!!

The current_stream() call inside _transfer_tensor_to_device is under with torch.cuda.stream(self.stream). So it is not returning the correct default stream to call record_stream on !

As for the onload_from_disk case, the current_stream is obtained outside the with context, so it works fine.

I force push a new commit, please help reviewing that again, thanks!

@DN6
Copy link
Collaborator

DN6 commented Nov 27, 2025

@KimbingNg Ah great catch! 👍🏽

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also remove the following lines:

if self.__class__.__name__ == "AutoencoderKLCosmosTests" and offload_type == "leaf_level":
            pytest.skip("With `leaf_type` as the offloading type, it fails. Needs investigation.")

in the test_group_offloading_with_disk() test implementation.

@KimbingNg
Copy link
Contributor Author

KimbingNg commented Nov 27, 2025

We can also remove the following lines:

if self.__class__.__name__ == "AutoencoderKLCosmosTests" and offload_type == "leaf_level":
            pytest.skip("With `leaf_type` as the offloading type, it fails. Needs investigation.")

in the test_group_offloading_with_disk() test implementation.

I ran pytest tests/models -k "test_group_offloading" after removing these two lines:

=============================================== 366 passed, 36 skipped, 3604 deselected, 1177 warnings in 123.01s (0:02:03) ===============================================

Should I push a commit that removes these two lines?

@sayakpaul
Copy link
Member

Yes

@KimbingNg
Copy link
Contributor Author

The two lines are removed

@sayakpaul sayakpaul requested a review from DN6 November 27, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QwenImagepipeline imcompatable with group offloading with record_stream=True

4 participants