Skip to content

Conversation

Vishal-sys-code
Copy link

This PR introduces Muon Optimizer Grafting, a variant of the Shampoo optimizer that incorporates Momentum-SGD for grafting. This approach is beneficial for training large-scale models efficiently, and follows the optimization direction mentioned in the open issue #203.

While not formally assigned to this task, I came across the discussion in issue #203, where there was a mention of Muon support being a potential addition in the future. The comment:

"Thanks for your kind words, and I am not familiar with K-FAC and TNT, but this repo is currently focusing on Shampoo-like algorithms, i.e., Distributed Shampoo, SOAP, Muon (coming soon) for now."

...inspired me to implement Muon grafting, contributing to the community and the project. I hope this can serve as a helpful starting point or even be merged directly, if appropriate.

What is included:

  • MuonGraftingConfig: A new configuration class to explicitly enable and control Muon grafting behavior.
  • DistributedShampoo enhancements: Updated to support and integrate the new Muon grafting mechanism cleanly.
  • Test coverage: Added test_muon_grafting to ensure the implementation behaves as expected and is compatible with the existing optimizer framework.

The design aligns with the architecture and practices observed in the repo and other optimizer implementations. Care has been taken to ensure maintainability, modularity, and minimal disruption to existing components.

Note: I understand the Muon optimizer was on the roadmap, and I hope this early implementation is helpful. Open to feedback and happy to revise or improve the implementation based on your guidance or plans for Muon support.

Thank you for your time and for maintaining this excellent repository.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2025
@tsunghsienlee
Copy link
Contributor

Hi @Vishal-sys-code ,

Thanks for your PR, and I wonder are you interested in Muon or using Muon for grafting?

For Muon, it is added few weeks ago, and https://github.com/facebookresearch/optimizers/tree/main/distributed_shampoo#example-6-muon is the instruction on how to use it.

For using Muon for grafting, that is not supported yet, and I am working on that by merging GraftingConfig into PreconditionerConfig so the Muon implementation above could be used for grafting as well.

Please let me know what is your need so we could see how this could be done together.

@Vishal-sys-code
Copy link
Author

Hi @tsunghsienlee, thanks for the pointer and the review!

My goal with this PR was not to add a separate Muon optimizer, but to let Shampoo use Muon for grafting. I noticed Muon is already in the upstream repo, awesome.

If you’re planning to merge GraftingConfig into PreconditionerConfig so the existing Muon implementation can be reused for grafting, I’m happy to adapt my changes to that design instead of duplicating code. Specifically, I can:

  • Rebase and update this PR to use the upstream Muon preconditioner (or a small adapter) for grafting.
  • Convert MuonGraftingConfig into a thin adapter/alias that maps to the unified config.
  • Expand test_muon_grafting to check for consistent behavior between “Muon as preconditioner” and “Muon used for grafting.”

Which approach would you prefer?

  • I can rework the PR now to match the PreconditionerConfig merge, or
  • I can keep this PR as a short-term, standalone Muon-grafting implementation and switch it over after your refactor lands.

I’m flexible, tell me which you’d like and I’ll update the branch. Thanks again!

@tsunghsienlee
Copy link
Contributor

Hi @Vishal-sys-code ,

Sorry for my late reply, I was working on merging GraftingConfig into PreconditionerConfig so we could merge the optimizers we could use for grafting and preconditioning. The problem is that there are some internal dependencies due to the types nature of PreconditionerConfig and GraftingConfig so it won't be easy for OSS to work on.

I think for now, would you mind to review my PR when I finish that? I could definitely credit you as the co-author of that PR because you have the same idea as I do.

@Vishal-sys-code
Copy link
Author

Vishal-sys-code commented Aug 17, 2025

Hi @tsunghsienlee no worries at all, and thanks for clarifying!

That sounds like a solid plan. I’d be glad to review your PR once it’s ready, and appreciate your kind offer to credit me as a co-author. Honestly, I’m just happy if my work helped spark or support the direction you’re already moving in.

Looking forward to your changes, please tag me when the PR is up, and I’ll give it a careful review. Thanks again for taking the time to explain the design decisions!

@tsunghsienlee
Copy link
Contributor

Close this one because #242 covered it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants