From 90986347d0bc337ccfe285fc27fe8310ecd365c2 Mon Sep 17 00:00:00 2001
From: haic0 <149741444+haic0@users.noreply.github.com>
Date: Wed, 6 Aug 2025 16:57:58 +0800
Subject: [PATCH 1/4] Create README_AMD_GPU.md

---
 DeepSeek/AMD_GPU/README.md | 45 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)
 create mode 100644 DeepSeek/AMD_GPU/README.md
diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md
new file mode 100644
index 0000000..bfec33b
--- /dev/null
+++ b/DeepSeek/AMD_GPU/README.md
@@ -0,0 +1,45 @@
+## AMD GPU Installation and Benchmarking Guide
+#### Support Matrix 
+
+##### GPU TYPE       
+MI300X
+##### DATA TYPE
+FP8
+
+#### Step by Step Guide
+Please follow the steps here to install and run DeepSeek-R1 models on AMD MI300X GPU.
+#### Step 1
+Launch the Rocm-vllm docker: 
+```shell
+docker run -it --rm \
+  --cap-add=SYS_PTRACE \
+  -e SHELL=/bin/bash \
+  --network=host \
+  --security-opt seccomp=unconfined \
+  --device=/dev/kfd \
+  --device=/dev/dri \
+  -v /:/workspace \
+  --group-add video \
+  --ipc=host \
+  --name vllm_DS \
+rocm/vllm-dev:nightly
+```
+#### Step 2
+  Huggingface login
+```shell
+   huggingface-cli login 
+```   
+#### Step 3
+##### FP8
+
+Run the vllm online serving
+Sample Command
+```shell
+SAFETENSORS_FAST_GPU=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8 --max-model-len 32768 --block-size 1 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768 --gpu-memory-utilization 0.95 --trust-remote-code
+```
+#### Step 4 
+Open a new terminal, enter into the running docker and run the following benchmark script.
+```shell
+docker exec -it vllm_DS /bin/bash 
+python3 /vllm-workspace/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500  --max-concurrency 128 --random-input-len 3200 --random-output-len 800  --percentile-metrics ttft,tpot,itl,e2el
+```

From 2e89cae7361ac8ad1aa5ccefde49b816dcd060dd Mon Sep 17 00:00:00 2001
From: haic0 <149741444+haic0@users.noreply.github.com>
Date: Thu, 7 Aug 2025 10:04:47 +0800
Subject: [PATCH 2/4] Update README.md

---
 README.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 76e036e..77235cd 100644
--- a/README.md
+++ b/README.md
@@ -18,6 +18,9 @@ This repo intends to host community maintained common recipes to run vLLM answer
 ### Qwen <img src="https://qwenlm.github.io/favicon.png" alt="Qwen" width="16" height="16" style="vertical-align:middle;">
 - [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md)
 
+### AMD GPU Support
+For the user guide,kindly review the AMD-GPU repository within the model directory.
+
 ## Contributing
 Please feel free to contribute by adding a new recipe or improving an existing one, just send us a PR!
 
@@ -31,4 +34,4 @@ uv run mkdocs serve
 ```
 
 ## License
-This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
\ No newline at end of file
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

From cbb075bc97ee792159824303c4eaa36d34375b49 Mon Sep 17 00:00:00 2001
From: haic0 <149741444+haic0@users.noreply.github.com>
Date: Thu, 7 Aug 2025 14:31:57 +0800
Subject: [PATCH 3/4] Update README.md

---
 DeepSeek/AMD_GPU/README.md | 45 +++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md
index bfec33b..74fc917 100644
--- a/DeepSeek/AMD_GPU/README.md
+++ b/DeepSeek/AMD_GPU/README.md
@@ -22,11 +22,12 @@ docker run -it --rm \
   --group-add video \
   --ipc=host \
   --name vllm_DS \
-rocm/vllm-dev:nightly
+rocm/vllm:latest
 ```
 #### Step 2
   Huggingface login
 ```shell
+   pip install -U "huggingface_hub[cli]"
    huggingface-cli login 
 ```   
 #### Step 3
@@ -35,11 +36,49 @@ rocm/vllm-dev:nightly
 Run the vllm online serving
 Sample Command
 ```shell
-SAFETENSORS_FAST_GPU=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1 -tp 8 --max-model-len 32768 --block-size 1 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768 --gpu-memory-utilization 0.95 --trust-remote-code
+SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
+vllm serve deepseek-ai/DeepSeek-R1 \
+--tensor-parallel-size 8 \
+--max-model-len 32768 \
+--max-num-seqs 1024 \
+--max-num-batched-tokens 32768 \
+--disable-log-requests \
+--block-size 1 \
+--compilation-config '{"full_cuda_graph":false}' \
+--trust-remote-code
 ```
 #### Step 4 
 Open a new terminal, enter into the running docker and run the following benchmark script.
 ```shell
 docker exec -it vllm_DS /bin/bash 
-python3 /vllm-workspace/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500  --max-concurrency 128 --random-input-len 3200 --random-output-len 800  --percentile-metrics ttft,tpot,itl,e2el
+python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500  --max-concurrency 128 --random-input-len 3200 --random-output-len 800  --percentile-metrics ttft,tpot,itl,e2el
+```
+```shell
+Maximum request concurrency: 128
+100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [04:43<00:00,  1.76it/s]
+============ Serving Benchmark Result ============
+Successful requests:                     500
+Benchmark duration (s):                  283.98
+Total input tokens:                      1597574
+Total generated tokens:                  400000
+Request throughput (req/s):              1.76
+Output token throughput (tok/s):         1408.53
+Total Token throughput (tok/s):          7034.09
+---------------Time to First Token----------------
+Mean TTFT (ms):                          7585.82
+Median TTFT (ms):                        4689.25
+P99 TTFT (ms):                           30544.70
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          80.02
+Median TPOT (ms):                        83.26
+P99 TPOT (ms):                           88.89
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           80.02
+Median ITL (ms):                         50.92
+P99 ITL (ms):                            2263.85
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          71521.56
+Median E2EL (ms):                        71237.75
+P99 E2EL (ms):                           97463.28
+==================================================
 ```

From b7d325e027bfa796f217af9dc4c020361db1bc4f Mon Sep 17 00:00:00 2001
From: haic0 <149741444+haic0@users.noreply.github.com>
Date: Thu, 7 Aug 2025 17:11:37 +0800
Subject: [PATCH 4/4] Update README.md

---
 DeepSeek/AMD_GPU/README.md | 84 +++++++++++++++++++++++++++-----------
 1 file changed, 60 insertions(+), 24 deletions(-)

diff --git a/DeepSeek/AMD_GPU/README.md b/DeepSeek/AMD_GPU/README.md
index 74fc917..41fc2cc 100644
--- a/DeepSeek/AMD_GPU/README.md
+++ b/DeepSeek/AMD_GPU/README.md
@@ -8,7 +8,32 @@ FP8
 
 #### Step by Step Guide
 Please follow the steps here to install and run DeepSeek-R1 models on AMD MI300X GPU.
+The model requires 8 * MI300X GPU.
+
 #### Step 1
+Verify the GPU environment: 
+```shell
+================================================== ROCm System Management Interface ==================================================
+============================================================ Concise Info ============================================================
+Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf              PwrCap  VRAM%  GPU%
+              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
+======================================================================================================================================
+0       9     0x74b5,   21947  51.0°C      163.0W    NPS1, SPX, 0        144Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+1       8     0x74b5,   37820  45.0°C      154.0W    NPS1, SPX, 0        141Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+2       7     0x74b5,   39350  46.0°C      163.0W    NPS1, SPX, 0        142Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+3       6     0x74b5,   24497  53.0°C      172.0W    NPS1, SPX, 0        142Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+4       5     0x74b5,   36258  51.0°C      169.0W    NPS1, SPX, 0        145Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+5       4     0x74b5,   19365  44.0°C      158.0W    NPS1, SPX, 0        148Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+6       3     0x74b5,   16815  53.0°C      167.0W    NPS1, SPX, 0        141Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+7       2     0x74b5,   34728  46.0°C      165.0W    NPS1, SPX, 0        141Mhz  900Mhz  0%   perf_determinism  750.0W  0%     0%
+======================================================================================================================================
+```
+Lock the GPU frequency
+```shell
+rocm-smi --setperfdeterminism 1900
+```
+
+### Step 2
 Launch the Rocm-vllm docker: 
 ```shell
 docker run -it --rm \
@@ -24,8 +49,7 @@ docker run -it --rm \
   --name vllm_DS \
 rocm/vllm:latest
 ```
-#### Step 2
-  Huggingface login
+Huggingface login
 ```shell
    pip install -U "huggingface_hub[cli]"
    huggingface-cli login 
@@ -36,10 +60,10 @@ rocm/vllm:latest
 Run the vllm online serving
 Sample Command
 ```shell
-SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
+NCCL_MIN_NCHANNELS=112 SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
 vllm serve deepseek-ai/DeepSeek-R1 \
 --tensor-parallel-size 8 \
---max-model-len 32768 \
+--max-model-len 65536 \
 --max-num-seqs 1024 \
 --max-num-batched-tokens 32768 \
 --disable-log-requests \
@@ -47,38 +71,50 @@ vllm serve deepseek-ai/DeepSeek-R1 \
 --compilation-config '{"full_cuda_graph":false}' \
 --trust-remote-code
 ```
+
+##### Tips: Users may modify the following parameters as needed.
+--max-model-len=65536: A good sweet spot in most cases; preserves memory while still allowing long context.
+
+--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency.
+
+If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.
+
+--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.
+
+
 #### Step 4 
-Open a new terminal, enter into the running docker and run the following benchmark script.
+Open a new terminal, access the running Docker container, and execute the online serving benchmark script as follows:
+
 ```shell
 docker exec -it vllm_DS /bin/bash 
-python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500  --max-concurrency 128 --random-input-len 3200 --random-output-len 800  --percentile-metrics ttft,tpot,itl,e2el
+python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500  --max-concurrency 256 --random-input-len 3200 --random-output-len 800  --percentile-metrics ttft,tpot,itl,e2el
 ```
 ```shell
-Maximum request concurrency: 128
-100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [04:43<00:00,  1.76it/s]
+Maximum request concurrency: 256
+100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [03:54<00:00,  2.14it/s]
 ============ Serving Benchmark Result ============
 Successful requests:                     500
-Benchmark duration (s):                  283.98
+Benchmark duration (s):                  234.00
 Total input tokens:                      1597574
 Total generated tokens:                  400000
-Request throughput (req/s):              1.76
-Output token throughput (tok/s):         1408.53
-Total Token throughput (tok/s):          7034.09
+Request throughput (req/s):              2.14
+Output token throughput (tok/s):         1709.39
+Total Token throughput (tok/s):          8536.59
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          7585.82
-Median TTFT (ms):                        4689.25
-P99 TTFT (ms):                           30544.70
+Mean TTFT (ms):                          18547.34
+Median TTFT (ms):                        5711.21
+P99 TTFT (ms):                           59776.29
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          80.02
-Median TPOT (ms):                        83.26
-P99 TPOT (ms):                           88.89
+Mean TPOT (ms):                          124.24
+Median TPOT (ms):                        140.70
+P99 TPOT (ms):                           144.12
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           80.02
-Median ITL (ms):                         50.92
-P99 ITL (ms):                            2263.85
+Mean ITL (ms):                           124.24
+Median ITL (ms):                         71.91
+P99 ITL (ms):                            2290.11
 ----------------End-to-end Latency----------------
-Mean E2EL (ms):                          71521.56
-Median E2EL (ms):                        71237.75
-P99 E2EL (ms):                           97463.28
+Mean E2EL (ms):                          117819.02
+Median E2EL (ms):                        118451.88
+P99 E2EL (ms):                           174508.24
 ==================================================
 ```