feat: Implement Muon Optimizer Grafting #227
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces Muon Optimizer Grafting, a variant of the Shampoo optimizer that incorporates Momentum-SGD for grafting. This approach is beneficial for training large-scale models efficiently, and follows the optimization direction mentioned in the open issue #203.
While not formally assigned to this task, I came across the discussion in issue #203, where there was a mention of Muon support being a potential addition in the future. The comment:
...inspired me to implement Muon grafting, contributing to the community and the project. I hope this can serve as a helpful starting point or even be merged directly, if appropriate.
What is included:
MuonGraftingConfig
: A new configuration class to explicitly enable and control Muon grafting behavior.DistributedShampoo
enhancements: Updated to support and integrate the new Muon grafting mechanism cleanly.test_muon_grafting
to ensure the implementation behaves as expected and is compatible with the existing optimizer framework.The design aligns with the architecture and practices observed in the repo and other optimizer implementations. Care has been taken to ensure maintainability, modularity, and minimal disruption to existing components.
Note: I understand the Muon optimizer was on the roadmap, and I hope this early implementation is helpful. Open to feedback and happy to revise or improve the implementation based on your guidance or plans for Muon support.
Thank you for your time and for maintaining this excellent repository.