Skip to content

Conversation

@emailweixu
Copy link
Contributor

In the previous implementation, fused_linear_act.StaticState will always allocate a cuda tensor once it is imported. The simple act of allocating a small tensor will cause torch to allocate several hundred MB cuda memory. This can become very bad if there are a lot of subprocesses.

Fix is simple, only create the tensor when it is needed.

In the previous implementation, fused_linear_act.StaticState will
always allocate a cuda tensor once it is imported. The simple act
of allocating a small tensor will cause torch to allocate several
hundred MB cuda memory. This can become very bad if there are a
lot of subprocesses.

Fix is simple, only create the tensor when it is needed.
@emailweixu emailweixu merged commit f2c844e into pytorch Nov 12, 2025
2 checks passed
@emailweixu emailweixu deleted the PR_fix_fused_linear_act_memory branch November 12, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants