Skip to content

Commit 835192d

Browse files
hchingsdominicshanshan
authored andcommitted
[None][doc] update v1.0 doc for trtllm-serve (NVIDIA#7056)
Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent ede27da commit 835192d

File tree

1 file changed

+34
-30
lines changed

1 file changed

+34
-30
lines changed

docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 34 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -201,56 +201,60 @@ Metrics Endpoint
201201
202202
.. note::
203203
204-
This endpoint is beta maturity.
204+
The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.
205205
206-
The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
206+
Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.
207207
208-
Some fields, such as CPU memory usage, are not available for the PyTorch backend.
208+
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
209209
210-
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can impact performance slightly, depending on the serving configuration.
210+
The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
211211
212-
The ``/metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
213-
For the TensorRT backend, these statistics are enabled by default.
214-
However, for the PyTorch backend, you must explicitly enable iteration statistics logging by setting the `enable_iter_perf_stats` field in a YAML configuration file as shown in the following example:
212+
For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file:
215213
216214
.. code-block:: yaml
217215
218-
# extra-llm-api-config.yml
219-
pytorch_backend_config:
220-
enable_iter_perf_stats: true
216+
# extra_llm_config.yaml
217+
enable_iter_perf_stats: true
221218
222-
Then start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file as shown in the following example:
219+
Start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file:
223220
224221
.. code-block:: bash
225222
226-
trtllm-serve <model> \
227-
--extra_llm_api_options <path-to-extra-llm-api-config.yml> \
228-
[--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
223+
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --extra_llm_api_options extra_llm_config.yaml
229224
230-
After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the `/metrics` endpoint:
225+
After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
226+
Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.
231227
232228
.. code-block:: bash
233229
234-
curl -X GET http://<host>:<port>/metrics
230+
curl -X GET http://localhost:8000/metrics
235231
236-
*Example Output*
232+
Example output:
237233
238234
.. code-block:: json
239235
240-
[
241-
{
242-
"gpuMemUsage": 56401920000,
243-
"inflightBatchingStats": {
236+
[
237+
{
238+
"gpuMemUsage": 76665782272,
239+
"iter": 154,
240+
"iterLatencyMS": 7.00688362121582,
241+
"kvCacheStats": {
242+
"allocNewBlocks": 3126,
243+
"allocTotalBlocks": 3126,
244+
"cacheHitRate": 0.00128,
245+
"freeNumBlocks": 101253,
246+
"maxNumBlocks": 101256,
247+
"missedBlocks": 3121,
248+
"reusedBlocks": 4,
249+
"tokensPerBlock": 32,
250+
"usedNumBlocks": 3
251+
},
252+
"numActiveRequests": 1
244253
...
245-
},
246-
"iter": 1,
247-
"iterLatencyMS": 16.505143404006958,
248-
"kvCacheStats": {
249-
...
250-
},
251-
"newActiveRequestsQueueLatencyMS": 0.0007503032684326172
252-
}
253-
]
254+
}
255+
]
256+
257+
254258
255259
Syntax
256260
------

0 commit comments

Comments
 (0)