Skip to content

Conversation

@lazykyama
Copy link

@lazykyama lazykyama commented Aug 9, 2025

This is a fix for an issue that perf_analyzer failed to proceed chunked responses from the server which is enabled for HTTP SSE.

When specifying --session-concurrency and --service-kind openai together for input payloads which include "stream": true, PA failed during parsing a response which is delta response with a SSE prefix, data: like below. Also, after one fix for this parsing issue, another error, what(): std::future_error: Promise already satisfied, happened. This was caused because PA doesn't properly consider SSE responses and multi-turn payload from the chat history.

So, this PR fixes these two problems.

$ perf_analyzer -m tensorrt_llm_bls --async --stability-percentage 999 --request-count 10 -i http -u ${SERVER_IP}:${SERVER_PORT} --service-kind openai --endpoint v1/chat/completions --input-data artifacts/tensorrt_llm_bls-openai-chat-session_concurrency2/inputs.json --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-session_concurrency2/profile_export.json --session-concurrency 2

 Successfully read data for 1 stream/streams with 32 step/steps.
*** Measurement Settings ***
  Service Kind: OPENAI
  Sending 10 benchmark requests
  Using asynchronous calls for inference

terminate called after throwing an instance of 'std::runtime_error'
  what():  RapidJSON parse error 3. Review JSON for formatting errors:

data: {"id":"cmpl-65fe0899-74b8-11f0-a1f8-89b7d5751a62","choices":[{"delta":{"content":"","function_call":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1754699578,"model":"tensorrt_llm_bls","system_fingerprint":null,"object":"chat.completion.chunk","usage":null}





Aborted (core dumped)

FYI. the result with --concurrency-range instead of --session-concurrency for the same payload excluding "session_id": is below. No issue happened.

$ perf_analyzer -m tensorrt_llm_bls --async --stability-percentage 999 --request-count 10 -i http -u ${SERVER_IP}:${SERVER_PORT} --service-kind openai --endpoint v1/chat/completions --input-data artifacts/tensorrt_llm_bls-openai-chat-concurrency2/inputs.json --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-concurrency2/profile_export.json --concurrency-range 2
 Successfully read data for 1 stream/streams with 100 step/steps.
*** Measurement Settings ***
  Service Kind: OPENAI
  Sending 10 benchmark requests
  Using asynchronous calls for inference

Request concurrency: 2
  Client:
    Request count: 10
    Throughput: 0.454482 infer/sec
    Avg latency: 3855839 usec (standard deviation 1410191 usec)
    p50 latency: 2925832 usec
    p90 latency: 8296468 usec
    p95 latency: 8383974 usec
    p99 latency: 8383974 usec
    Avg HTTP time: 4028137 usec (send/recv 4010796 usec + response wait 17341 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 0.454482 infer/sec, latency 3855839 usec

Note that example input payloads which include "stream": true are generated via genai-perf like below.

genai-perf profile \
    --model "tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5" \
    --url ${SERVER_IP}:${SERVER_PORT} \
    --endpoint-type chat \
    --streaming \
    --num-sessions 10 \
    --session-concurrency 2 \
    --session-turns-mean 3 \
    --session-turns-stddev 1 \
    --session-turn-delay-mean 1000 \
    --session-turn-delay-stddev 5 \
    --num-prefix-prompts 3 \
    --prefix-prompt-length 30 \
    --synthetic-input-tokens-mean 100 \
    --synthetic-input-tokens-stddev 30 \
    --output-tokens-mean 100 \
    --output-tokens-stddev 30 \
    --tokenizer /hf_home/hub/models--tokyotech-llm--Llama-3.1-Swallow-8B-Instruct-v0.5/snapshots/b1f8317099a97e790ec872c1225ca155979b4816/ \
    --extra-inputs temperature:0.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant