Skip to content

Conversation

tianyu-l
Copy link
Contributor

fixes bug introduced in #1555

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 18, 2025
@@ -52,6 +52,7 @@ tensor_parallel_degree = 1
enable_async_tensor_parallel = false
pipeline_parallel_degree = 1
pipeline_parallel_schedule = "1F1B"
context_parallel_degree = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deepseek-v3, I remember CP brings the loss Nan issue. Are you trying to support CP here? Or need to revert this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting it in debug_model.toml is fine.

If we would like to warn users, we should error out in code. Hiding it from toml doesn't help too much, IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, yeah let me error out CP in deepseek-v3

local_rank = device_mesh.get_local_rank()
token_indices_experts_sorted += num_tokens // device_mesh.size() * local_rank
token_indices_experts_sorted += (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Numerically why this is a bug? Seems that the calculation are still the same but making the num_tokens a class parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape of num_tokens are different.
For input, the shape is [bs * slen, topk].
For output, the shape is [bs * slen * topk,], hence the bug.

@tianyu-l tianyu-l merged commit 9233d83 into main Aug 18, 2025
7 checks passed
@tianyu-l tianyu-l deleted the dsv3 branch August 18, 2025 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants