feat: concurrent multi stream executor for rec_model. #548

zhang-minchao · 2025-12-15T15:30:30Z

benchmark client cmd:

python3 test_per.py --backend xllm --dataset-name random --random-range-ratio 1 --num-prompt 100  --max-concurrency 2 --random-input 320 --random-output 1 --host 0.0.0.0 --port 17983 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --model /export/home/zhangminchao1/models/Qwen3-0.6B

Normal scheduler and llm_worker, server cmd:

$BIN_PATH \
    --model $MODEL_PATH \
    --host=11.87.189.98 \
    --port $PORT \
    --devices="npu:$DEVICE" \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$WORLD_SIZE \
    --node_rank=$i \
    --max_memory_utilization=0.85 \
    --max_tokens_per_batch=8192 \
    --max_seqs_per_batch=256 \
    --block_size=128 \
    --communication_backend=hccl \
    --enable_schedule_overlap=false \
    --enable_chunked_prefill=false \
    --enable_prefix_cache=false

result:

============ Serving Benchmark Result ============
Backend:                                 xllm      
Traffic request rate:                    inf       
Max reqeuest concurrency:                2         
Successful requests:                     100       
Benchmark duration (s):                  1.12      
Total input tokens:                      32000     
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Avg input tokens:                        320.0     
Avg generated tokens:                    1.0       
Request throughput (req/s):              88.93     
Input token throughput (tok/s):          28456.38  
Output token throughput (tok/s):         88.93     
Total token throughput (tok/s):          28545.31  
Concurrency:                             1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22.00     
Median E2E Latency (ms):                 21.67     
---------------Time to First Token----------------
Mean TTFT (ms):                          21.88     
Median TTFT (ms):                        21.53     
P90 TTFT (ms):                           22.26     
P99 TTFT (ms):                           29.32     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P90 TPOT (ms):                           0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P90 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
==================================================

fixedstep_scheduler and concurrent_llm_worker, server cmd:

$BIN_PATH \
    --model $MODEL_PATH \
    --host=11.87.189.98 \
    --port $PORT \
    --devices="npu:$DEVICE" \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$WORLD_SIZE \
    --node_rank=$i \
    --max_memory_utilization=0.85 \
    --max_tokens_per_batch=8192 \
    --max_seqs_per_batch=256 \
    --block_size=128 \
    --communication_backend=hccl \
    --enable_schedule_overlap=false \
    --enable_chunked_prefill=false \
    --enable_prefix_cache=false \
    --llm_worker_max_concurrency=2 \
    --enable_fixedsteps_scheduler=true

result:

============ Serving Benchmark Result ============
Backend:                                 xllm      
Traffic request rate:                    inf       
Max reqeuest concurrency:                2         
Successful requests:                     100       
Benchmark duration (s):                  0.73      
Total input tokens:                      32000     
Total generated tokens:                  100       
Total generated tokens (retokenized):    0         
Avg input tokens:                        320.0     
Avg generated tokens:                    1.0       
Request throughput (req/s):              136.65    
Input token throughput (tok/s):          43728.85  
Output token throughput (tok/s):         136.65    
Total token throughput (tok/s):          43865.50  
Concurrency:                             1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14.33     
Median E2E Latency (ms):                 14.15     
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00      
Median TTFT (ms):                        0.00      
P90 TTFT (ms):                           0.00      
P99 TTFT (ms):                           0.00      
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P90 TPOT (ms):                           0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P90 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
==================================================

XuZhang99 · 2025-12-16T02:53:37Z

xllm/core/common/global_flags.cpp

+             "Number of streams for parallel step execution.");
+
+// --- fixsteps scheduler config ---
+DEFINE_bool(use_fixsteps_scheduler,


what's the meaning of fixsteps?

fixed_steps

yq33victor · 2025-12-16T07:56:37Z

xllm/core/runtime/worker.cpp

-    impl_ = new LLMWorkerImpl(parallel_args, device, options);
+    if (fLB::FLAGS_enable_concurrent_llm_worker) {
+      impl_ = new ConcurrentLLMWorkerImpl(
+          parallel_args, device, options, FLAGS_concurrent_execute_stream_num);


nit: delete fLB:: ?

and maybe we dont need to pass FLAGS_concurrent_execute_stream_num to ConcurrentLLMWorkerImpl function.

yq33victor · 2025-12-16T07:57:23Z

xllm/core/scheduler/continuous_scheduler.cpp

      return;
    }
-
+    // LOG(INFO) << "ContinuousScheduler::step: batch size " << batch.size();


nit: delete comment.

yq33victor · 2025-12-16T08:02:21Z

xllm/core/common/global_flags.cpp

+            "Whether to enable multi stream llm.");
+
+DEFINE_int32(concurrent_execute_stream_num,
+             2,


nit: can a single flag be used instead of these two flags? like we use concurrent_execute_stream_num, when concurrent_execute_stream_num=1 we know that use default llm worker. Cause that there to many flags now :( . Just a suggestion.

zhang-minchao requested a review from magicheng0816 December 15, 2025 15:30

XuZhang99 reviewed Dec 16, 2025

View reviewed changes

zhang-minchao force-pushed the feat/concurrent_multi_stream_executor branch from f9e2fc0 to b3cd0b5 Compare December 16, 2025 06:58

yq33victor reviewed Dec 16, 2025

View reviewed changes

zhang-minchao force-pushed the feat/concurrent_multi_stream_executor branch 2 times, most recently from 27cf6cc to 87fc7cc Compare December 16, 2025 09:29

feat: add multi stream concurrent execution for rec_model.

87fc7cc

zhang-minchao changed the title ~~feat: concurrent multi stream executor for rec_model~~ feat: concurrent multi stream executor for rec_model. Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: concurrent multi stream executor for rec_model. #548

feat: concurrent multi stream executor for rec_model. #548

Uh oh!

zhang-minchao commented Dec 15, 2025 •

edited

Loading

Uh oh!

XuZhang99 Dec 16, 2025

Uh oh!

zhang-minchao Dec 16, 2025

Uh oh!

yq33victor Dec 16, 2025

Uh oh!

zhang-minchao Dec 16, 2025

Uh oh!

yq33victor Dec 16, 2025

Uh oh!

zhang-minchao Dec 16, 2025

Uh oh!

yq33victor Dec 16, 2025

Uh oh!

zhang-minchao Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: concurrent multi stream executor for rec_model. #548

Are you sure you want to change the base?

feat: concurrent multi stream executor for rec_model. #548

Uh oh!

Conversation

zhang-minchao commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhang-minchao commented Dec 15, 2025 •

edited

Loading