Skip to content

Conversation

pavanimajety
Copy link

WIP. Will try merging into existing DeepSeek

@@ -0,0 +1,459 @@
# DSR1 Status with vLLM: Aggregated Serving on B200

**Overall Health**: Most paths work. DP Attention is failing in combination with Flashinfer MOE Kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we link this to a GitHub issue?

**How to Invoke:**

FlashInfer:
- Automatic on SM100 (requires flashinfer installed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "Automatic" mean here? it automatically uses FlashInfer gemm on SM100? But this seems to conflict with line 26 which says DeepGemm is the default?

- `VLLM_USE_DEEP_GEMM_E8M0=1` (default)

CUTLASS BlockScale:
- Automatic fallback (requires CUDA 12.8+ for SM100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "Automatic fallback" mean? Fall back from what to what?

- `VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=throughput`

CUTLASS BlockScale:
- Default on SM100 (auto-selected with block quant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either FlashInfer TRTLLM-Gen or DeepGemm is the most performant, why do we default to CUTLASS? Should we just use DeepGemm by default?

**Available Backends:**

TP/EP:
- Flashinfer TRTLLM-Gen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Flashinfer TRTLLM-Gen and CUTLASS, right? (according to the "How to Invoke" section)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, maybe CUTLASS does not support this yet.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="latency" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we recommend using latency mode, should we just make it the default?

VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="latency" \
vllm serve nvidia/DeepSeek-R1-FP4
--quantization="modelopt_fp4" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this flag needed?

VLLM_FLASHINFER_MOE_BACKEND="latency" \
vllm serve nvidia/DeepSeek-R1-FP4
--quantization="modelopt_fp4" \
--trust-remote-code \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this flag needed?

--quantization="modelopt_fp4" \
--trust-remote-code \
--max-model-len=2048 \
--block-size=128 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this flag do?

--enable-expert-parallel \
--gpu-memory-utilization=0.8 \
--tensor-parallel-size=1 \
--data-parallel-size=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also disable prefix caching for perf benchmarking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants