Skip to content

Conversation

@zhang-minchao
Copy link
Collaborator

@zhang-minchao zhang-minchao commented Dec 15, 2025

benchmark client cmd:

python3 test_per.py --backend xllm --dataset-name random --random-range-ratio 1 --num-prompt 100  --max-concurrency 2 --random-input 320 --random-output 1 --host 0.0.0.0 --port 17983 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --model /export/home/zhangminchao1/models/Qwen3-0.6B

Normal scheduler and llm_worker, server cmd:

$BIN_PATH \
    --model $MODEL_PATH \
    --host=11.87.189.98 \
    --port $PORT \
    --devices="npu:$DEVICE" \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$WORLD_SIZE \
    --node_rank=$i \
    --max_memory_utilization=0.85 \
    --max_tokens_per_batch=8192 \
    --max_seqs_per_batch=256 \
    --block_size=128 \
    --communication_backend=hccl \
    --enable_schedule_overlap=false \
    --enable_chunked_prefill=false \
    --enable_prefix_cache=false

result:

============ Serving Benchmark Result ============
Backend:                                 xllm      
Traffic request rate:                    inf       
Max reqeuest concurrency:                2         
Successful requests:                     100       
Benchmark duration (s):                  1.12      
Total input tokens:                      32000     
Total generated tokens:                  100       
Total generated tokens (retokenized):    100       
Avg input tokens:                        320.0     
Avg generated tokens:                    1.0       
Request throughput (req/s):              88.93     
Input token throughput (tok/s):          28456.38  
Output token throughput (tok/s):         88.93     
Total token throughput (tok/s):          28545.31  
Concurrency:                             1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22.00     
Median E2E Latency (ms):                 21.67     
---------------Time to First Token----------------
Mean TTFT (ms):                          21.88     
Median TTFT (ms):                        21.53     
P90 TTFT (ms):                           22.26     
P99 TTFT (ms):                           29.32     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P90 TPOT (ms):                           0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P90 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
==================================================

fixedstep_scheduler and concurrent_llm_worker, server cmd:

$BIN_PATH \
    --model $MODEL_PATH \
    --host=11.87.189.98 \
    --port $PORT \
    --devices="npu:$DEVICE" \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$WORLD_SIZE \
    --node_rank=$i \
    --max_memory_utilization=0.85 \
    --max_tokens_per_batch=8192 \
    --max_seqs_per_batch=256 \
    --block_size=128 \
    --communication_backend=hccl \
    --enable_schedule_overlap=false \
    --enable_chunked_prefill=false \
    --enable_prefix_cache=false \
    --llm_worker_max_concurrency=2 \
    --enable_fixedsteps_scheduler=true

result:

============ Serving Benchmark Result ============
Backend:                                 xllm      
Traffic request rate:                    inf       
Max reqeuest concurrency:                2         
Successful requests:                     100       
Benchmark duration (s):                  0.73      
Total input tokens:                      32000     
Total generated tokens:                  100       
Total generated tokens (retokenized):    0         
Avg input tokens:                        320.0     
Avg generated tokens:                    1.0       
Request throughput (req/s):              136.65    
Input token throughput (tok/s):          43728.85  
Output token throughput (tok/s):         136.65    
Total token throughput (tok/s):          43865.50  
Concurrency:                             1.96      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14.33     
Median E2E Latency (ms):                 14.15     
---------------Time to First Token----------------
Mean TTFT (ms):                          0.00      
Median TTFT (ms):                        0.00      
P90 TTFT (ms):                           0.00      
P99 TTFT (ms):                           0.00      
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P90 TPOT (ms):                           0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P90 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
==================================================

"Number of streams for parallel step execution.");

// --- fixsteps scheduler config ---
DEFINE_bool(use_fixsteps_scheduler,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of fixsteps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed_steps

@zhang-minchao zhang-minchao force-pushed the feat/concurrent_multi_stream_executor branch from f9e2fc0 to b3cd0b5 Compare December 16, 2025 06:58
impl_ = new LLMWorkerImpl(parallel_args, device, options);
if (fLB::FLAGS_enable_concurrent_llm_worker) {
impl_ = new ConcurrentLLMWorkerImpl(
parallel_args, device, options, FLAGS_concurrent_execute_stream_num);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete fLB:: ?

and maybe we dont need to pass FLAGS_concurrent_execute_stream_num to ConcurrentLLMWorkerImpl function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return;
}

// LOG(INFO) << "ContinuousScheduler::step: batch size " << batch.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"Whether to enable multi stream llm.");

DEFINE_int32(concurrent_execute_stream_num,
2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can a single flag be used instead of these two flags? like we use concurrent_execute_stream_num, when concurrent_execute_stream_num=1 we know that use default llm worker. Cause that there to many flags now :( . Just a suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zhang-minchao zhang-minchao force-pushed the feat/concurrent_multi_stream_executor branch 2 times, most recently from 27cf6cc to 87fc7cc Compare December 16, 2025 09:29
@zhang-minchao zhang-minchao changed the title feat: concurrent multi stream executor for rec_model feat: concurrent multi stream executor for rec_model. Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants