-
Notifications
You must be signed in to change notification settings - Fork 61
Add DeepSeek-R1 B200 Recipes #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pavani Majety <[email protected]>
| @@ -0,0 +1,459 @@ | |||
| # DSR1 Status with vLLM: Aggregated Serving on B200 | |||
|
|
|||
| **Overall Health**: Most paths work. DP Attention is failing in combination with Flashinfer MOE Kernels | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we link this to a GitHub issue?
| **How to Invoke:** | ||
|
|
||
| FlashInfer: | ||
| - Automatic on SM100 (requires flashinfer installed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "Automatic" mean here? it automatically uses FlashInfer gemm on SM100? But this seems to conflict with line 26 which says DeepGemm is the default?
| - `VLLM_USE_DEEP_GEMM_E8M0=1` (default) | ||
|
|
||
| CUTLASS BlockScale: | ||
| - Automatic fallback (requires CUDA 12.8+ for SM100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "Automatic fallback" mean? Fall back from what to what?
| - `VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=throughput` | ||
|
|
||
| CUTLASS BlockScale: | ||
| - Default on SM100 (auto-selected with block quant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If either FlashInfer TRTLLM-Gen or DeepGemm is the most performant, why do we default to CUTLASS? Should we just use DeepGemm by default?
| **Available Backends:** | ||
|
|
||
| TP/EP: | ||
| - Flashinfer TRTLLM-Gen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Flashinfer TRTLLM-Gen and CUTLASS, right? (according to the "How to Invoke" section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, maybe CUTLASS does not support this yet.
| CUDA_VISIBLE_DEVICES=0,1,2,3 \ | ||
| VLLM_USE_STANDALONE_COMPILE=0 \ | ||
| VLLM_USE_FLASHINFER_MOE_FP4=1 \ | ||
| VLLM_FLASHINFER_MOE_BACKEND="latency" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we recommend using latency mode, should we just make it the default?
| VLLM_USE_FLASHINFER_MOE_FP4=1 \ | ||
| VLLM_FLASHINFER_MOE_BACKEND="latency" \ | ||
| vllm serve nvidia/DeepSeek-R1-FP4 | ||
| --quantization="modelopt_fp4" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this flag needed?
| VLLM_FLASHINFER_MOE_BACKEND="latency" \ | ||
| vllm serve nvidia/DeepSeek-R1-FP4 | ||
| --quantization="modelopt_fp4" \ | ||
| --trust-remote-code \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this flag needed?
| --quantization="modelopt_fp4" \ | ||
| --trust-remote-code \ | ||
| --max-model-len=2048 \ | ||
| --block-size=128 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this flag do?
| --enable-expert-parallel \ | ||
| --gpu-memory-utilization=0.8 \ | ||
| --tensor-parallel-size=1 \ | ||
| --data-parallel-size=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also disable prefix caching for perf benchmarking?
WIP. Will try merging into existing DeepSeek