-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Change the all-reduce strategy to NCCL #6226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Change the all-reduce strategy to NCCL #6226
Conversation
When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]>
tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]>
WalkthroughThe changes update the all-reduce strategy in a distributed deployment module to use NCCL instead of AUTO as a workaround for a known NVIDIA bug, with a comment noting this is temporary. Additionally, a test that was previously skipped for multi-GPU scenarios is now enabled to run for all configurations. Changes
Estimated code review effort2 (~12 minutes) Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py (1)Learnt from: amitz-nv 🔇 Additional comments (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
PR_Github #12459 [ run ] triggered by Bot |
PR_Github #12459 [ run ] completed with state |
When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA
memory errors.
Re-enable test_ad_build_small_multi.py
Summary by CodeRabbit
Bug Fixes
Tests