From 90986347d0bc337ccfe285fc27fe8310ecd365c2 Mon Sep 17 00:00:00 2001 From: haic0 <149741444+haic0@users.noreply.github.com> Date: Wed, 6 Aug 2025 16:57:58 +0800 Subject: [PATCH 1/4] Create README_AMD_GPU.md --- DeepSeek/AMD_GPU/README.md | 45 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) create mode 100644 DeepSeek/AMD_GPU/README.md diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md new file mode 100644 index 0000000..bfec33b --- /dev/null +++ b/DeepSeek/AMD_GPU/README.md @@ -0,0 +1,45 @@ +## AMD GPU Installation and Benchmarking Guide +#### Support Matrix + +##### GPU TYPE +MI300X +##### DATA TYPE +FP8 + +#### Step by Step Guide +Please follow the steps here to install and run DeepSeek-R1 models on AMD MI300X GPU. +#### Step 1 +Launch the Rocm-vllm docker: +```shell +docker run -it --rm \ + --cap-add=SYS_PTRACE \ + -e SHELL=/bin/bash \ + --network=host \ + --security-opt seccomp=unconfined \ + --device=/dev/kfd \ + --device=/dev/dri \ + -v /:/workspace \ + --group-add video \ + --ipc=host \ + --name vllm_DS \ +rocm/vllm-dev:nightly +``` +#### Step 2 + Huggingface login +```shell + huggingface-cli login +``` +#### Step 3 +##### FP8 + +Run the vllm online serving +Sample Command +```shell +SAFETENSORS_FAST_GPU=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8 --max-model-len 32768 --block-size 1 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768 --gpu-memory-utilization 0.95 --trust-remote-code +``` +#### Step 4 +Open a new terminal, enter into the running docker and run the following benchmark script. +```shell +docker exec -it vllm_DS /bin/bash +python3 /vllm-workspace/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 128 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el +``` From 2e89cae7361ac8ad1aa5ccefde49b816dcd060dd Mon Sep 17 00:00:00 2001 From: haic0 <149741444+haic0@users.noreply.github.com> Date: Thu, 7 Aug 2025 10:04:47 +0800 Subject: [PATCH 2/4] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 76e036e..77235cd 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ This repo intends to host community maintained common recipes to run vLLM answer ### Qwen Qwen - [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md) +### AMD GPU Support +For the user guide,kindly review the AMD-GPU repository within the model directory. + ## Contributing Please feel free to contribute by adding a new recipe or improving an existing one, just send us a PR! @@ -31,4 +34,4 @@ uv run mkdocs serve ``` ## License -This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. \ No newline at end of file +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. From cbb075bc97ee792159824303c4eaa36d34375b49 Mon Sep 17 00:00:00 2001 From: haic0 <149741444+haic0@users.noreply.github.com> Date: Thu, 7 Aug 2025 14:31:57 +0800 Subject: [PATCH 3/4] Update README.md --- DeepSeek/AMD_GPU/README.md | 45 +++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md index bfec33b..74fc917 100644 --- a/DeepSeek/AMD_GPU/README.md +++ b/DeepSeek/AMD_GPU/README.md @@ -22,11 +22,12 @@ docker run -it --rm \ --group-add video \ --ipc=host \ --name vllm_DS \ -rocm/vllm-dev:nightly +rocm/vllm:latest ``` #### Step 2 Huggingface login ```shell + pip install -U "huggingface_hub[cli]" huggingface-cli login ``` #### Step 3 @@ -35,11 +36,49 @@ rocm/vllm-dev:nightly Run the vllm online serving Sample Command ```shell -SAFETENSORS_FAST_GPU=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8 --max-model-len 32768 --block-size 1 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768 --gpu-memory-utilization 0.95 --trust-remote-code +SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \ +vllm serve deepseek-ai/DeepSeek-R1 \ +--tensor-parallel-size 8 \ +--max-model-len 32768 \ +--max-num-seqs 1024 \ +--max-num-batched-tokens 32768 \ +--disable-log-requests \ +--block-size 1 \ +--compilation-config '{"full_cuda_graph":false}' \ +--trust-remote-code ``` #### Step 4 Open a new terminal, enter into the running docker and run the following benchmark script. ```shell docker exec -it vllm_DS /bin/bash -python3 /vllm-workspace/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 128 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el +python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 128 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el +``` +```shell +Maximum request concurrency: 128 +100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [04:43<00:00, 1.76it/s] +============ Serving Benchmark Result ============ +Successful requests: 500 +Benchmark duration (s): 283.98 +Total input tokens: 1597574 +Total generated tokens: 400000 +Request throughput (req/s): 1.76 +Output token throughput (tok/s): 1408.53 +Total Token throughput (tok/s): 7034.09 +---------------Time to First Token---------------- +Mean TTFT (ms): 7585.82 +Median TTFT (ms): 4689.25 +P99 TTFT (ms): 30544.70 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 80.02 +Median TPOT (ms): 83.26 +P99 TPOT (ms): 88.89 +---------------Inter-token Latency---------------- +Mean ITL (ms): 80.02 +Median ITL (ms): 50.92 +P99 ITL (ms): 2263.85 +----------------End-to-end Latency---------------- +Mean E2EL (ms): 71521.56 +Median E2EL (ms): 71237.75 +P99 E2EL (ms): 97463.28 +================================================== ``` From b7d325e027bfa796f217af9dc4c020361db1bc4f Mon Sep 17 00:00:00 2001 From: haic0 <149741444+haic0@users.noreply.github.com> Date: Thu, 7 Aug 2025 17:11:37 +0800 Subject: [PATCH 4/4] Update README.md --- DeepSeek/AMD_GPU/README.md | 84 +++++++++++++++++++++++++++----------- 1 file changed, 60 insertions(+), 24 deletions(-) diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md index 74fc917..41fc2cc 100644 --- a/DeepSeek/AMD_GPU/README.md +++ b/DeepSeek/AMD_GPU/README.md @@ -8,7 +8,32 @@ FP8 #### Step by Step Guide Please follow the steps here to install and run DeepSeek-R1 models on AMD MI300X GPU. +The model requires 8 * MI300X GPU. + #### Step 1 +Verify the GPU environment: +```shell +================================================== ROCm System Management Interface ================================================== +============================================================ Concise Info ============================================================ +Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% + (DID, GUID) (Junction) (Socket) (Mem, Compute, ID) +====================================================================================================================================== +0 9 0x74b5, 21947 51.0°C 163.0W NPS1, SPX, 0 144Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +1 8 0x74b5, 37820 45.0°C 154.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +2 7 0x74b5, 39350 46.0°C 163.0W NPS1, SPX, 0 142Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +3 6 0x74b5, 24497 53.0°C 172.0W NPS1, SPX, 0 142Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +4 5 0x74b5, 36258 51.0°C 169.0W NPS1, SPX, 0 145Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +5 4 0x74b5, 19365 44.0°C 158.0W NPS1, SPX, 0 148Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +6 3 0x74b5, 16815 53.0°C 167.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +7 2 0x74b5, 34728 46.0°C 165.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0% +====================================================================================================================================== +``` +Lock the GPU frequency +```shell +rocm-smi --setperfdeterminism 1900 +``` + +### Step 2 Launch the Rocm-vllm docker: ```shell docker run -it --rm \ @@ -24,8 +49,7 @@ docker run -it --rm \ --name vllm_DS \ rocm/vllm:latest ``` -#### Step 2 - Huggingface login +Huggingface login ```shell pip install -U "huggingface_hub[cli]" huggingface-cli login @@ -36,10 +60,10 @@ rocm/vllm:latest Run the vllm online serving Sample Command ```shell -SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \ +NCCL_MIN_NCHANNELS=112 SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \ vllm serve deepseek-ai/DeepSeek-R1 \ --tensor-parallel-size 8 \ ---max-model-len 32768 \ +--max-model-len 65536 \ --max-num-seqs 1024 \ --max-num-batched-tokens 32768 \ --disable-log-requests \ @@ -47,38 +71,50 @@ vllm serve deepseek-ai/DeepSeek-R1 \ --compilation-config '{"full_cuda_graph":false}' \ --trust-remote-code ``` + +##### Tips: Users may modify the following parameters as needed. +--max-model-len=65536: A good sweet spot in most cases; preserves memory while still allowing long context. + +--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency. + +If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs. + +--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput. + + #### Step 4 -Open a new terminal, enter into the running docker and run the following benchmark script. +Open a new terminal, access the running Docker container, and execute the online serving benchmark script as follows: + ```shell docker exec -it vllm_DS /bin/bash -python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 128 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el +python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 256 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el ``` ```shell -Maximum request concurrency: 128 -100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [04:43<00:00, 1.76it/s] +Maximum request concurrency: 256 +100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [03:54<00:00, 2.14it/s] ============ Serving Benchmark Result ============ Successful requests: 500 -Benchmark duration (s): 283.98 +Benchmark duration (s): 234.00 Total input tokens: 1597574 Total generated tokens: 400000 -Request throughput (req/s): 1.76 -Output token throughput (tok/s): 1408.53 -Total Token throughput (tok/s): 7034.09 +Request throughput (req/s): 2.14 +Output token throughput (tok/s): 1709.39 +Total Token throughput (tok/s): 8536.59 ---------------Time to First Token---------------- -Mean TTFT (ms): 7585.82 -Median TTFT (ms): 4689.25 -P99 TTFT (ms): 30544.70 +Mean TTFT (ms): 18547.34 +Median TTFT (ms): 5711.21 +P99 TTFT (ms): 59776.29 -----Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 80.02 -Median TPOT (ms): 83.26 -P99 TPOT (ms): 88.89 +Mean TPOT (ms): 124.24 +Median TPOT (ms): 140.70 +P99 TPOT (ms): 144.12 ---------------Inter-token Latency---------------- -Mean ITL (ms): 80.02 -Median ITL (ms): 50.92 -P99 ITL (ms): 2263.85 +Mean ITL (ms): 124.24 +Median ITL (ms): 71.91 +P99 ITL (ms): 2290.11 ----------------End-to-end Latency---------------- -Mean E2EL (ms): 71521.56 -Median E2EL (ms): 71237.75 -P99 E2EL (ms): 97463.28 +Mean E2EL (ms): 117819.02 +Median E2EL (ms): 118451.88 +P99 E2EL (ms): 174508.24 ================================================== ```