fix: wrong handling chunked response in streaming mode and concurrent session #431

lazykyama · 2025-08-09T00:45:20Z

This is a fix for an issue that perf_analyzer failed to proceed chunked responses from the server which is enabled for HTTP SSE.

When specifying --session-concurrency and --service-kind openai together for input payloads which include "stream": true, PA failed during parsing a response which is delta response with a SSE prefix, data: like below. Also, after one fix for this parsing issue, another error, what(): std::future_error: Promise already satisfied, happened. This was caused because PA doesn't properly consider SSE responses and multi-turn payload from the chat history.

So, this PR fixes these two problems.

$ perf_analyzer -m tensorrt_llm_bls --async --stability-percentage 999 --request-count 10 -i http -u ${SERVER_IP}:${SERVER_PORT} --service-kind openai --endpoint v1/chat/completions --input-data artifacts/tensorrt_llm_bls-openai-chat-session_concurrency2/inputs.json --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-session_concurrency2/profile_export.json --session-concurrency 2

 Successfully read data for 1 stream/streams with 32 step/steps.
*** Measurement Settings ***
  Service Kind: OPENAI
  Sending 10 benchmark requests
  Using asynchronous calls for inference

terminate called after throwing an instance of 'std::runtime_error'
  what():  RapidJSON parse error 3. Review JSON for formatting errors:

data: {"id":"cmpl-65fe0899-74b8-11f0-a1f8-89b7d5751a62","choices":[{"delta":{"content":"","function_call":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1754699578,"model":"tensorrt_llm_bls","system_fingerprint":null,"object":"chat.completion.chunk","usage":null}





Aborted (core dumped)

FYI. the result with --concurrency-range instead of --session-concurrency for the same payload excluding "session_id": is below. No issue happened.

$ perf_analyzer -m tensorrt_llm_bls --async --stability-percentage 999 --request-count 10 -i http -u ${SERVER_IP}:${SERVER_PORT} --service-kind openai --endpoint v1/chat/completions --input-data artifacts/tensorrt_llm_bls-openai-chat-concurrency2/inputs.json --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-concurrency2/profile_export.json --concurrency-range 2
 Successfully read data for 1 stream/streams with 100 step/steps.
*** Measurement Settings ***
  Service Kind: OPENAI
  Sending 10 benchmark requests
  Using asynchronous calls for inference

Request concurrency: 2
  Client:
    Request count: 10
    Throughput: 0.454482 infer/sec
    Avg latency: 3855839 usec (standard deviation 1410191 usec)
    p50 latency: 2925832 usec
    p90 latency: 8296468 usec
    p95 latency: 8383974 usec
    p99 latency: 8383974 usec
    Avg HTTP time: 4028137 usec (send/recv 4010796 usec + response wait 17341 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 0.454482 infer/sec, latency 3855839 usec

Note that example input payloads which include "stream": true are generated via genai-perf like below.

genai-perf profile \
    --model "tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5" \
    --url ${SERVER_IP}:${SERVER_PORT} \
    --endpoint-type chat \
    --streaming \
    --num-sessions 10 \
    --session-concurrency 2 \
    --session-turns-mean 3 \
    --session-turns-stddev 1 \
    --session-turn-delay-mean 1000 \
    --session-turn-delay-stddev 5 \
    --num-prefix-prompts 3 \
    --prefix-prompt-length 30 \
    --synthetic-input-tokens-mean 100 \
    --synthetic-input-tokens-stddev 30 \
    --output-tokens-mean 100 \
    --output-tokens-stddev 30 \
    --tokenizer /hf_home/hub/models--tokyotech-llm--Llama-3.1-Swallow-8B-Instruct-v0.5/snapshots/b1f8317099a97e790ec872c1225ca155979b4816/ \
    --extra-inputs temperature:0.5

… session

fix: wrong handling chunked response in streaming mode and concurrent…

ae00d80

… session

ganeshku1 requested a review from matthewkotila September 29, 2025 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: wrong handling chunked response in streaming mode and concurrent session #431

fix: wrong handling chunked response in streaming mode and concurrent session #431

Uh oh!

lazykyama commented Aug 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: wrong handling chunked response in streaming mode and concurrent session #431

Are you sure you want to change the base?

fix: wrong handling chunked response in streaming mode and concurrent session #431

Uh oh!

Conversation

lazykyama commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lazykyama commented Aug 9, 2025 •

edited

Loading