Skip to content

Conversation

LinPoly
Copy link
Collaborator

@LinPoly LinPoly commented Aug 26, 2025

Summary by CodeRabbit

  • New Features
    • Added Harmony (GPT-OSS) chat support: OpenAI↔Harmony bridging, streaming, delta reasoning, and delta-based tool-calling.
    • Exposed explicit function-calling workflow for weather tools with concrete function-call examples and outputs.
  • Bug Fixes
    • max_tokens now capped by computed defaults with warnings for invalid configs.
  • Documentation
    • Simplified server startup and added Triton/MoE kernel setup and concrete function-call demos.
  • Tests
    • Added unit and e2e tests covering Harmony chat, reasoning, streaming, and tool calls.
  • Chores
    • Added dependency: openai-harmony==0.0.4.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Pengyun Lin <[email protected]>
Signed-off-by: Pengyun Lin <[email protected]>
Signed-off-by: Pengyun Lin <[email protected]>
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds Harmony (GPT‑OSS) chat integration with reasoning and tool‑calling, a HarmonyAdapter and stream state, OpenAI protocol extensions, a Harmony chat server path, executor max_tokens cap logic change, example client function‑calling refactor, new dependency, and accompanying tests/config updates.

Changes

Cohort / File(s) Summary
Docs update
examples/models/core/gpt_oss/README.md
Simplifies server invocation and replaces narrative examples with explicit OpenAI function‑calling snippets and a new Triton kernels section.
Example client: native function-calling
examples/models/core/gpt_oss/openai_chat_client_function_calling.py
Replaces prompt/regex parsing with OpenAI function‑calling definitions, maps function names to local callables, parses JSON args, invokes tools, appends tool outputs, and simplifies follow‑up calls.
Dependencies
requirements.txt
Adds openai-harmony==0.0.4.
Executor token cap logic
tensorrt_llm/executor/worker.py
Reworks _deduce_max_tokens: computes default_max_tokens from max_seq_len and uses user max_tokens only as a cap; warns or errors when max_seq_len missing or default invalid.
Harmony adapter (new module)
tensorrt_llm/serve/harmony_adapter.py
New HarmonyAdapter and HarmonyStreamState: OpenAI↔Harmony token conversion, reasoning channels, tool‑call parsing/filtering, streaming and non‑streaming conversion, stream state utilities, token helpers, and error fallbacks.
OpenAI protocol extensions
tensorrt_llm/serve/openai_protocol.py
Adds DeltaFunctionCall, DeltaToolCall, ReasoningAssistantMessage; makes ChatMessage.content optional and adds reasoning field; delta messages allow nullable tool_calls; adds reasoning_effort and max_tokens alias; expands tool_choice including "auto".
Server integration
tensorrt_llm/serve/openai_server.py
Adds use_harmony switch, initializes HarmonyAdapter, and implements chat_harmony handler that tokenizes to Harmony, appends stop tokens, invokes LLM (streaming/non‑streaming), and uses Harmony streaming/non‑streaming adapters.
Integration test wrapper
tests/integration/defs/test_e2e.py
Adds test_openai_chat_harmony wrapper that runs the new pytest module for Harmony chat tests.
Test list config
tests/integration/test_lists/test-db/l0_h100.yml
Adds Harmony chat test entry to the PyTorch pre_merge test list.
Unit tests: Harmony chat
tests/unittest/llmapi/apps/_test_openai_chat_harmony.py
New async tests exercising reasoning/reasoning_effort, multi‑turn chat, tool calling, streaming, and streaming with tools; includes helper tool descriptor and fixtures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client as Client (OpenAI API)
  participant Server as OpenAIServer
  participant Adapter as HarmonyAdapter
  participant LLM as LLM Runtime
  Note over Server: use_harmony = true → chat_harmony
  Client->>Server: POST /v1/chat/completions (messages, tools, reasoning_effort, stream?)
  Server->>Adapter: openai_to_harmony_tokens(messages, tools, tool_choice)
  Adapter-->>Server: Harmony tokens (+ stop tokens)
  alt stream == true
    Server->>LLM: generate_stream(tokens, sampling)
    loop token batches
      LLM-->>Server: token batch
      Server->>Adapter: stream_harmony_tokens_to_openai_deltas(batch, tools)
      Adapter-->>Server: OpenAI deltas (reasoning/content/tool_calls)
      Server-->>Client: SSE chunk(s)
    end
    Server->>Adapter: finalize stream state
    Adapter-->>Server: final delta/termination
    Server-->>Client: [DONE]
  else non-streaming
    Server->>LLM: generate(tokens, sampling)
    LLM-->>Server: final tokens
    Server->>Adapter: harmony_output_to_openai(tokens, tools)
    Adapter-->>Server: ChatCompletionResponse
    Server-->>Client: JSON response
  end
Loading
sequenceDiagram
  autonumber
  participant User as User
  participant Script as Example Client
  participant OpenAI as Chat API
  participant Tool as Local Python Tool

  User->>Script: Run example
  Script->>OpenAI: chat.create(messages, tools)
  OpenAI-->>Script: assistant(tool_calls, reasoning)
  Script->>Script: parse tool_call (name,args)
  Script->>Tool: invoke(name, **args)
  Tool-->>Script: result (JSON)
  Script->>OpenAI: chat.create(messages + assistant tool_calls + tool result)
  OpenAI-->>Script: final assistant(content)
  Script-->>User: print reasoning + content
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Suggested reviewers

  • laikhtewari
  • hlu1
  • syuoni
  • juney-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@LinPoly LinPoly self-assigned this Aug 26, 2025
@LinPoly
Copy link
Collaborator Author

LinPoly commented Aug 26, 2025

/bot run

@LinPoly LinPoly requested review from JunyiXu-nv, juney-nvidia and hlu1 and removed request for symphonylyh and nv-guomingz August 26, 2025 12:39
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16558 [ run ] triggered by Bot

@LinPoly LinPoly requested a review from dongfengy August 26, 2025 12:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (1)

63-66: Use typing.List for Python 3.8 compatibility.

list[str] requires Python 3.9+. The repo targets Python 3.8+, so switch to typing.List.

+from typing import List, Dict
@@
-def get_multiple_weathers(locations: list[str],
-                          format: str = "celsius") -> list[dict]:
-    return [get_current_weather(location, format) for location in locations]
+def get_multiple_weathers(locations: List[str],
+                          format: str = "celsius") -> List[dict]:
+    return [get_current_weather(location, format) for location in locations]

Optionally also annotate get_current_weather’s return type as Dict[str, object] for clarity.

tensorrt_llm/serve/openai_server.py (1)

223-256: Health generate should call the Harmony path when use_harmony is enabled.

Currently it always invokes openai_chat, which may not be compatible with GPT‑OSS chat templating.

Proposed change (outside the edited hunk):

# In health_generate()
handler = self.chat_harmony if self.use_harmony else self.openai_chat
response = await handler(health_request, mock_request)
🧹 Nitpick comments (14)
examples/models/core/gpt_oss/README.md (3)

38-39: Clarify server flags required for Harmony path and tool-calling.

The simplified command helps, but Harmony chat completions and tool-calling generally require enabling the Harmony route in the OpenAI server path. Please add the exact flags/env to turn on Harmony for this example (e.g., an --enable_harmony or extra_server_options snippet) so users can reproduce the function-calling flow without guesswork.

I can draft the precise startup command once you confirm the server toggle used in openai_server.py for the Harmony handler.


65-68: Call out that “[COT]” reasoning may include sensitive chain-of-thought and is optional to print.

For public examples, consider noting that printing reasoning is optional and can be disabled if users don’t want to expose COT in logs.

Proposed addition after the sample output:

+Note: The `[COT]` line reflects the model’s reasoning tokens. If you prefer not to print chain-of-thought,
+omit that log line; function calling still works without it.

87-95: Consider showing multi-tool-call handling (N>1) or clarify one-call assumption.

The example prints a single function call. Some models may emit multiple tool_calls. Add a short note that the snippet picks the first tool_call, or expand the sample to iterate through all calls.

- The output would look like:
+ The output would look like (first tool_call shown; if multiple are returned, iterate and resolve each):
tensorrt_llm/serve/openai_protocol.py (4)

1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the current-year NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

344-349: DeltaToolCall: minor consistency and docs.

  • Add a one-liner docstring (delta fragment) and consider defaulting type="function" when present.
-class DeltaToolCall(OpenAIBaseModel):
+class DeltaToolCall(OpenAIBaseModel):
+    """Delta fragment of a tool_call in streaming responses."""
     id: Optional[str] = None
-    type: Optional[Literal["function"]] = None
+    type: Optional[Literal["function"]] = None  # when present, must be "function"
     index: int
     function: Optional[DeltaFunctionCall] = None

351-357: Avoid dual fields for reasoning (“reasoning_content” vs “reasoning”).

Both reasoning_content and reasoning exist, which can confuse clients. If “reasoning” is the new GPT‑OSS field, either:

  • deprecate reasoning_content with a clear comment, or
  • alias reasoning to reasoning_content for backward compatibility.

Proposed minimal tweak:

 class ChatMessage(OpenAIBaseModel):
     role: str
     content: Optional[str] = None
-    reasoning_content: Optional[str] = None
-    reasoning: Optional[str] = None
+    reasoning_content: Optional[str] = None  # deprecated; use `reasoning`
+    reasoning: Optional[str] = None

And add a short note in docstrings or API docs to guide users.


495-515: max_completion_tokens alias and tool_choice default behavior: document and align with client expectations.

  • The validation_alias='max_tokens' is spot on. Please add a one-line docstring on the field noting request-side alias support.
  • Defaulting tool_choice to "none" but overriding to "auto" when tools are provided (see validator below) is sensible; document this behavior in the class docstring to avoid surprises.
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (4)

1-1: Missing NVIDIA copyright header.

Please add the standard header at the top of this Python file.

+#!/usr/bin/env python3
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

99-108: Harden tool_call parsing and JSON argument handling.

arguments may be None or malformed JSON depending on the model. Add a defensive parse with a clear error message.

-    kwargs = json.loads(tool_call.function.arguments)
+    raw_args = tool_call.function.arguments
+    if raw_args is None:
+        raise ValueError(f"Tool call '{func_name}' missing arguments")
+    try:
+        kwargs = json.loads(raw_args)
+    except json.JSONDecodeError as e:
+        raise ValueError(f"Invalid JSON in tool arguments for '{func_name}': {e.msg}") from e

Also consider iterating all tool_calls if multiple are present.


114-122: Ensure assistant/tool messages conform to the extended protocol.

Assistant message without content relies on your ReasoningAssistantMessage extension. That’s fine server-side, but some OpenAI clients are strict. If you encounter client-side validation, pass content="" explicitly.

 messages.extend([{
     "role": "assistant",
+    "content": "",
     "reasoning": reasoning,
     "tool_calls": [tool_call],
 }, {
     "role": "tool",
     "content": json.dumps(answer),
     "tool_call_id": tool_call.id
 }])

6-27: Minor: “format” shadows built-in; consider renaming in examples.

Not critical, but using “unit” or “temp_unit” avoids shadowing Python’s built-in format(). If you prefer keeping the API surface consistent with tests, you can keep the payload field “format” but rename the function parameter internally.

Also applies to: 29-56

tests/unittest/llmapi/apps/_test_openai_chat_harmony.py (1)

166-218: Initialize tool_name to avoid UnboundLocalError in streaming tool-call test.

If the stream ends before a name delta arrives, tool_name is unbound. Initialize defensively.

Apply this small tweak:

-    tool_name: str
+    tool_name: str = ""
tensorrt_llm/serve/harmony_adapter.py (2)

281-305: Docstrings and line lengths: tidy to satisfy Ruff D205/D200/E501.

Several docstrings in this module trip D205/D200 and a few lines exceed 120 chars. Not functional, but will keep CI clean.

Would you like me to push a patch that normalizes these docstrings and wraps the long lines flagged by Ruff?


1239-1339: Stateful streaming conversion is solid; minor simplification possible.

Sent tool-call argument deltas keep only the incremental chunk — good. You can drop the sent_tool_arguments boolean map and use a set, since the value is unused.

I can provide a small follow-up patch if you prefer.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 80043af and a27287a.

📒 Files selected for processing (10)
  • examples/models/core/gpt_oss/README.md (3 hunks)
  • examples/models/core/gpt_oss/openai_chat_client_function_calling.py (2 hunks)
  • requirements.txt (1 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/serve/harmony_adapter.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (7 hunks)
  • tensorrt_llm/serve/openai_server.py (4 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_chat_harmony.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/executor/worker.py
  • tests/unittest/llmapi/apps/_test_openai_chat_harmony.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/harmony_adapter.py
  • examples/models/core/gpt_oss/openai_chat_client_function_calling.py
  • tensorrt_llm/serve/openai_protocol.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/executor/worker.py
  • tests/unittest/llmapi/apps/_test_openai_chat_harmony.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/harmony_adapter.py
  • examples/models/core/gpt_oss/openai_chat_client_function_calling.py
  • tensorrt_llm/serve/openai_protocol.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_h100.yml
🧬 Code graph analysis (5)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (4)
  • llm_root (180-181)
  • llm_venv (707-723)
  • test_root (2189-2190)
  • unittest_path (90-91)
tests/unittest/llmapi/apps/_test_openai_chat_harmony.py (2)
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (1)
  • get_current_weather (59-60)
tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (1)
  • tool_get_current_weather (56-87)
tensorrt_llm/serve/openai_server.py (1)
tensorrt_llm/serve/harmony_adapter.py (5)
  • HarmonyAdapter (279-1409)
  • handle_non_streaming_response (1489-1543)
  • handle_streaming_response (1412-1486)
  • openai_to_harmony_tokens (538-730)
  • get_stop_tokens (321-327)
tensorrt_llm/serve/harmony_adapter.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (10)
  • ChatCompletionRequest (483-663)
  • ChatCompletionResponse (424-433)
  • ChatCompletionResponseChoice (410-421)
  • ChatCompletionResponseStreamChoice (445-451)
  • ChatCompletionStreamResponse (454-460)
  • ChatMessage (351-356)
  • DeltaFunctionCall (332-334)
  • DeltaMessage (436-442)
  • DeltaToolCall (344-348)
  • UsageInfo (62-65)
tensorrt_llm/executor/result.py (3)
  • request_id (521-522)
  • token_ids_diff (140-141)
  • prompt_token_ids (525-526)
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (2)
tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (1)
  • tool_get_current_weather (56-87)
tests/unittest/llmapi/apps/_test_openai_chat_harmony.py (1)
  • get_current_weather (75-76)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_server.py

670-672: 1 blank line required between summary line and description

(D205)

tensorrt_llm/serve/harmony_adapter.py

33-36: 1 blank line required between summary line and description

(D205)


81-83: 1 blank line required between summary line and description

(D205)


270-272: 1 blank line required between summary line and description

(D205)


318-321: 1 blank line required between summary line and description

(D205)


413-415: 1 blank line required between summary line and description

(D205)


541-541: Line too long (123 > 120)

(E501)


557-557: Line too long (126 > 120)

(E501)


729-730: One-line docstring should fit on one line

Reformat to one line

(D200)


742-744: 1 blank line required between summary line and description

(D205)


902-904: 1 blank line required between summary line and description

(D205)


967-967: Line too long (127 > 120)

(E501)


1329-1331: 1 blank line required between summary line and description

(D205)


1347-1349: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
requirements.txt (1)

70-70: Pin OpenAI to ensure compatibility with openai-harmony enums

To avoid runtime/type mismatches (e.g. the new ReasoningEffort enum in openai_protocol.py), please pin a compatible OpenAI version and add a CI check:

• File: requirements.txt (around line 70)
• Replace the unpinned openai entry with a version range. For example:
diff - openai + openai>=1.52,<2

• In your CI (or locally), install the requirements and verify compatibility:

pip install -r requirements.txt

python - <<'PY'
import importlib.metadata as md
# Show resolved versions
print({pkg: md.version(pkg) for pkg in ["openai","openai-harmony","pydantic"]})

# Verify ReasoningEffort enum values
from openai_harmony import ReasoningEffort
assert {e.name.lower() for e in ReasoningEffort} >= {"low","medium","high"}

# Sanity-check OpenAI types used by our client/server contract
from openai.types.chat import ChatCompletionAssistantMessageParam
print("Compatibility checks passed")
PY

Once you’ve run this in your CI and confirmed the exact versions, we can finalize the pin range.

tensorrt_llm/serve/openai_protocol.py (2)

396-404: ReasoningAssistantMessage: ensure parity with OpenAI client types.

This extends ChatCompletionAssistantMessageParam. Confirm that adding an extra “reasoning” field won’t be rejected by strict clients. If needed, document that the TRT‑LLM server accepts this field even if OpenAI’s client types don’t yet surface it.

I can add a compatibility note in the docs after we confirm whether the OpenAI Python client drops unknown fields on request serialization.


440-443: DeltaMessage: tool_calls should be a list when present; confirm index semantics.

The delta shape now allows Optional[List[DeltaToolCall]]. Ensure the stream assembler handles partial fragments (e.g., function.name first, then arguments chunks) and aggregates by index.

If not already covered, I can add a unit test that feeds interleaved tool_call deltas and validates assembly.

tests/integration/defs/test_e2e.py (1)

1509-1514: Wrapper test for Harmony looks good and consistent with existing pattern.

Uses unittest_path + pytest via llm_venv.run_cmd, matching other app wrappers. No issues spotted.

tensorrt_llm/serve/openai_server.py (1)

181-183: Route selection is OK; consider making it dynamic if model switching is possible at runtime.

If the model (or model_type) can change without restarting the server, the handler decision would go stale. If not, current approach is fine.

Would the same server process ever serve a different model_type during its lifetime?

tests/unittest/llmapi/apps/_test_openai_chat_harmony.py (1)

29-52: Expectations around reasoning tokens might be model-dependent.

Asserting message.reasoning unconditionally can cause flakes if the model suppresses analysis in some runs. Consider asserting it is present or content has non-empty fallback.

Would you like to relax the assertion to accept either reasoning or content presence, or keep it strict for GPT‑OSS?

tensorrt_llm/serve/harmony_adapter.py (1)

321-329: Stop tokens helper is a nice encapsulation.

Returning stop tokens for <|return|> and <|call|> centralizes GPT‑OSS behavior.

@LinPoly
Copy link
Collaborator Author

LinPoly commented Aug 26, 2025

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16559 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16558 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16559 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit a27287a

@LinPoly
Copy link
Collaborator Author

LinPoly commented Aug 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16583 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/serve/openai_protocol.py (1)

1-3: Add NVIDIA copyright header and future annotations import (Python 3.8+).

Per repo guidelines, prepend NVIDIA copyright header to all source files. Also, this module uses PEP 585/604 typing (list[int], A | B); add future annotations to ensure Python 3.8 compatibility.

Apply:

+// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+// All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
 # Adapted from
 # https://github.com/vllm-project/vllm/blob/4db5176d9758b720b05460c50ace3c01026eb158/vllm/entrypoints/openai/protocol.py
tensorrt_llm/serve/openai_server.py (1)

1-1: Insert NVIDIA header under the shebang.

Keep the shebang on line 1; place the header immediately after to satisfy tooling and OS expectations.

 #!/usr/bin/env python
+// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+// All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
♻️ Duplicate comments (8)
tensorrt_llm/serve/openai_protocol.py (2)

332-335: Subclass DeltaFunctionCall from OpenAIBaseModel to inherit config (extra="forbid", populate_by_name).

This aligns with the project’s Pydantic base and avoids inconsistent validation.

-class DeltaFunctionCall(BaseModel):
+class DeltaFunctionCall(OpenAIBaseModel):
     name: Optional[str] = None
     arguments: Optional[str] = None

642-646: Treat explicit None like missing for tool_choice default to “auto”.

Currently only the absence of the key flips to "auto". If clients send tool_choice=None with tools present, it stays None and may later behave like "none". Normalize None → "auto".

-        if "tool_choice" not in data and data.get("tools"):
-            data["tool_choice"] = "auto"
+        if data.get("tools") and data.get("tool_choice") is None:
+            # Default to "auto" when tools are provided and no explicit choice is made.
+            data["tool_choice"] = "auto"
tensorrt_llm/serve/openai_server.py (3)

104-107: Guard against missing model_config to avoid AttributeError.

AutoConfig loading can fail; accessing .model_type unguarded will crash server init.

-        # gpt-oss
-        self.harmony_adapter = HarmonyAdapter()
-        self.use_harmony = self.model_config.model_type == "gpt_oss"
+        # gpt-oss
+        self.harmony_adapter = HarmonyAdapter()
+        # Be defensive: model_config can be None
+        self.use_harmony = getattr(self.model_config, "model_type", None) == "gpt_oss"

694-697: Avoid logging user content at error level; reduce PII exposure.

Dumping messages/tools/request at error level can leak sensitive inputs. Keep details at debug with careful scoping.

[security]

-            except Exception as e:
-                logger.error(f"messages_dict: {request.messages}")
-                logger.error(f"tools_dict: {tools_dict}")
-                logger.error(f"request: {request}")
-                raise e
+            except Exception as e:
+                # Avoid dumping potentially sensitive user inputs at error level.
+                logger.debug("Harmony conversion failed; tools only: %s", tools_dict)
+                logger.debug("Request meta: model=%s stream=%s", request.model, request.stream)
+                raise

710-716: Pass disaggregated_params to generate_async for feature parity and metrics.

The OpenAI path passes disaggregated_params; Harmony path should too.

-            # Generate
-            promise = self.llm.generate_async(
+            # Generate
+            disaggregated_params = to_llm_disaggregated_params(request.disaggregated_params)
+            promise = self.llm.generate_async(
                 inputs=harmony_tokens,
                 sampling_params=sampling_params,
                 streaming=bool(request.stream),
                 lora_request=request.lora_request,
+                disaggregated_params=disaggregated_params,
             )
tensorrt_llm/serve/harmony_adapter.py (3)

1-3: Missing NVIDIA header; add per repository policy.

Also keep existing upstream attribution. Place NVIDIA header first.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

4-16: Python 3.8+ compatibility: enable postponed evaluation of annotations.

This file uses PEP 585/604 typing. Add future annotations at the top.

+# NOTE: Python 3.8+ compatibility for PEP 585/604 annotations
+from __future__ import annotations

1433-1440: Streaming: call token_ids_diff() (method), not property.

Without parentheses the adapter streams a bound method object instead of token IDs.

-            responses = harmony_adapter.create_openai_streaming_response(
+            responses = harmony_adapter.create_openai_streaming_response(
                 request_id=request_id,
-                tokens=output.token_ids_diff,
+                tokens=output.token_ids_diff(),
                 available_tools=tools_for_parser,
                 model_name=request.model,
                 tool_choice=tool_choice)
🧹 Nitpick comments (7)
tensorrt_llm/serve/openai_protocol.py (1)

351-356: Clarify “reasoning” vs “reasoning_content” fields; avoid dual field confusion.

Both ChatMessage and DeltaMessage expose reasoning via two names. Consider keeping a single canonical field (“reasoning”) and deprecating the other, or document the intended difference explicitly.

If dual fields are intentional for backward compatibility, add brief docstrings/comments at declaration sites to codify semantics and migration.

Also applies to: 436-443

tensorrt_llm/serve/openai_server.py (3)

56-56: Typo: fix yapf directive.

-# yapf: enale
+# yapf: enable

671-674: Fix docstring per D205: one-line summary, blank line before details.

-        """
-        Chat Completion API with harmony format support.
-        Supports both streaming and non-streaming modes.
-        """
+        """Chat Completion API with Harmony format support.
+
+        Supports both streaming and non-streaming modes.
+        """

225-257: Health generate path uses openai_chat even when use_harmony=True.

Consider calling chat_harmony when Harmony is enabled so the health check exercises the active code path.

Would you like a small patch that dispatches based on self.use_harmony?

tensorrt_llm/serve/harmony_adapter.py (3)

33-37: Docstring style: add a blank line after the summary (D205).

-    """
-    Maintains harmony parsing state for a single request across multiple token batches.
-    This is essential because vLLM provides incremental token batches, but harmony
-    parsing requires continuous state for channel transitions and tool calls.
-    """
+    """Maintains parsing state across token batches.
+
+    vLLM provides incremental token batches, but Harmony parsing requires
+    continuous state for channel transitions and tool calls.
+    """

1034-1040: Nit: fix “HARMONY _OUTPUT” typo for consistency.

-            raw_text = self._safe_decode_utf8(harmony_output_tokens,
-                                              "HARMONY _OUTPUT: ")
+            raw_text = self._safe_decode_utf8(harmony_output_tokens,
+                                              "HARMONY_OUTPUT: ")

317-320: Style: several docstrings violate D205; consider a quick sweep.

Multiple docstrings need a blank line after the summary; a quick ruff --fix will resolve these en masse.

If you want, I can generate a minimal patch for the key public APIs (create_openai_streaming_response, stateful_stream_harmony_tokens_to_openai_deltas, create_stream_state).

Also applies to: 741-743, 901-903, 1326-1328, 1344-1346

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a27287a and 65f6782.

📒 Files selected for processing (4)
  • tensorrt_llm/serve/harmony_adapter.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (7 hunks)
  • tensorrt_llm/serve/openai_server.py (4 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_h100.yml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/harmony_adapter.py
  • tensorrt_llm/serve/openai_server.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/harmony_adapter.py
  • tensorrt_llm/serve/openai_server.py
🧠 Learnings (2)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/serve/harmony_adapter.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/serve/harmony_adapter.py
🧬 Code graph analysis (2)
tensorrt_llm/serve/harmony_adapter.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (9)
  • ChatCompletionResponse (424-433)
  • ChatCompletionResponseChoice (410-421)
  • ChatCompletionResponseStreamChoice (445-451)
  • ChatCompletionStreamResponse (454-460)
  • ChatMessage (351-356)
  • DeltaFunctionCall (332-334)
  • DeltaMessage (436-442)
  • DeltaToolCall (344-348)
  • UsageInfo (62-65)
tensorrt_llm/executor/result.py (3)
  • request_id (521-522)
  • token_ids_diff (140-141)
  • prompt_token_ids (525-526)
tensorrt_llm/serve/openai_server.py (3)
tensorrt_llm/serve/harmony_adapter.py (6)
  • HarmonyAdapter (278-1406)
  • handle_non_streaming_response (1486-1540)
  • handle_streaming_response (1409-1483)
  • maybe_transform_reasoning_effort (1587-1598)
  • openai_to_harmony_tokens (537-729)
  • get_stop_tokens (320-326)
tensorrt_llm/serve/openai_protocol.py (3)
  • ChatCompletionRequest (483-665)
  • to_sampling_params (260-302)
  • to_sampling_params (590-631)
tensorrt_llm/llmapi/llm.py (2)
  • tokenizer (691-695)
  • tokenizer (698-699)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/harmony_adapter.py

33-36: 1 blank line required between summary line and description

(D205)


81-83: 1 blank line required between summary line and description

(D205)


269-271: 1 blank line required between summary line and description

(D205)


317-320: 1 blank line required between summary line and description

(D205)


412-414: 1 blank line required between summary line and description

(D205)


540-540: Line too long (123 > 120)

(E501)


556-556: Line too long (126 > 120)

(E501)


728-729: One-line docstring should fit on one line

Reformat to one line

(D200)


741-743: 1 blank line required between summary line and description

(D205)


901-903: 1 blank line required between summary line and description

(D205)


1326-1328: 1 blank line required between summary line and description

(D205)


1344-1346: 1 blank line required between summary line and description

(D205)

tensorrt_llm/serve/openai_server.py

671-673: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/serve/openai_protocol.py (2)

495-518: Alias and default for max tokens look good.

Using validation_alias='max_tokens' and Optional[int] for max_completion_tokens is consistent with OpenAI semantics. No change requested.


14-14: Dependency Confirmed: openai-harmony Declared
The import from openai_harmony import ReasoningEffort is covered by openai-harmony==0.0.4 in requirements.txt (line 70). No further action required.

tensorrt_llm/serve/openai_server.py (1)

182-183: Route selection looks good.

Conditional routing to Harmony path via use_harmony keeps non-Harmony behavior unchanged.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16583 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12451 completed with status: 'FAILURE'

@dongfengy
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16626 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16626 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12481 completed with status: 'FAILURE'

Signed-off-by: Pengyun Lin <[email protected]>
@LinPoly
Copy link
Collaborator Author

LinPoly commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16668 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/serve/openai_server.py (1)

1-4: Add future annotations for Python 3.8+ compatibility.

PEP 604 unions (e.g., HarmonyAdapter | None) require postponed evaluation on 3.8.

 #!/usr/bin/env python
+from __future__ import annotations
 import asyncio
♻️ Duplicate comments (8)
tensorrt_llm/serve/openai_server.py (3)

104-107: Guard against missing model_config to avoid AttributeError.

self.model_config can be None; accessing .model_type will crash.

-        # gpt-oss
-        self.harmony_adapter: HarmonyAdapter | None = None
-        self.use_harmony = self.model_config.model_type == "gpt_oss"
+        # gpt-oss
+        self.harmony_adapter: HarmonyAdapter | None = None
+        self.use_harmony = getattr(self.model_config, "model_type", None) == "gpt_oss"

690-701: Avoid logging PII-heavy request objects at error level.

Demote to debug and log minimal metadata.

             except Exception as e:
-                logger.error(f"messages_dict: {request.messages}")
-                logger.error(f"tools_dict: {tools_dict}")
-                logger.error(f"request: {request}")
+                logger.debug("Harmony conversion failed. tools=%s model=%s stream=%s",
+                             tools_dict, request.model, request.stream)
                 raise e

710-721: Pass disaggregated_params to generate_async for feature parity.

Align with openai_chat path.

-            sampling_params = request.to_sampling_params(
+            sampling_params = request.to_sampling_params(
                 vocab_size=self.tokenizer.tokenizer.vocab_size)
             sampling_params.detokenize = False  # Harmony adapter handles detokenization
@@
-            # Generate
-            promise = self.llm.generate_async(
+            # Generate
+            disaggregated_params = to_llm_disaggregated_params(request.disaggregated_params)
+            promise = self.llm.generate_async(
                 inputs=harmony_tokens,
                 sampling_params=sampling_params,
                 streaming=bool(request.stream),
                 lora_request=request.lora_request,
+                disaggregated_params=disaggregated_params,
             )
tensorrt_llm/serve/openai_protocol.py (1)

644-646: Default tool_choice to “auto” when tools are provided and tool_choice is None.

Treat explicit None same as missing to match OpenAI behavior.

-        if "tool_choice" not in data and data.get("tools"):
-            data["tool_choice"] = "auto"
+        if data.get("tools") and data.get("tool_choice") is None:
+            data["tool_choice"] = "auto"
tensorrt_llm/serve/harmony_adapter.py (4)

1-3: Add NVIDIA copyright header (required).

Per repo guidelines, prepend the NVIDIA header with current year and Apache-2.0 SPDX.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
+# All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

4-11: Enable postponed annotations for Python 3.8+.

PEP 585/604 types are used; add future import.

+from __future__ import annotations
 import json

1434-1439: Streaming: call token_ids_diff() (method), not attribute.

Without parentheses, you stream a bound method object.

-                tokens=output.token_ids_diff,
+                tokens=output.token_ids_diff(),

1570-1584: Usage accounting: call prompt_token_ids() (method).

Avoid treating it as a property.

-    assert final_res.prompt_token_ids is not None
-    num_prompt_tokens = len(final_res.prompt_token_ids)
+    assert callable(getattr(final_res, "prompt_token_ids", None)), "prompt_token_ids() not available"
+    num_prompt_tokens = len(final_res.prompt_token_ids())
🧹 Nitpick comments (5)
examples/models/core/gpt_oss/README.md (1)

35-39: Remove/clarify “XGrammar enabled” phrasing or show how it’s enabled.

The new invocation trtllm-serve <model> doesn’t indicate any XGrammar flag or YAML. Either drop “with XGrammar enabled” or add the exact CLI/YAML knob that enables structural tags to keep instructions actionable.

-First, launch a server with XGrammar enabled:
+First, launch a server (no extra XGrammar config needed for this example):
tensorrt_llm/serve/openai_server.py (2)

56-56: Typo: “yapf: enale”.

Fix the directive spelling.

-# yapf: enale
+# yapf: enable

671-674: Docstring formatting (D205).

Make summary line followed by a blank line.

-        """
-        Chat Completion API with harmony format support.
-        Supports both streaming and non-streaming modes.
-        """
+        """Chat Completion API with Harmony format support.
+
+        Supports both streaming and non-streaming modes.
+        """
tensorrt_llm/serve/harmony_adapter.py (2)

1415-1466: Optional: emit only semantic deltas (avoid raw harmony tokens in content).

You currently send "<|end|>" markers. Consider suppressing raw channel tokens in SSE for cleaner OpenAI-compatible streams unless tests assert preservation.


317-320: Docstring style (D205/D200).

Several multi-line docstrings lack the blank line after summary or exceed one line constraints. Consider tightening to keep linters green.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 65f6782 and bd43072.

📒 Files selected for processing (5)
  • examples/models/core/gpt_oss/README.md (3 hunks)
  • tensorrt_llm/executor/worker.py (1 hunks)
  • tensorrt_llm/serve/harmony_adapter.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (7 hunks)
  • tensorrt_llm/serve/openai_server.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/serve/harmony_adapter.py
  • tensorrt_llm/serve/openai_server.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/serve/harmony_adapter.py
  • tensorrt_llm/serve/openai_server.py
🧠 Learnings (2)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/serve/harmony_adapter.py
📚 Learning: 2025-08-25T22:42:47.587Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-25T22:42:47.587Z
Learning: Applies to **/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py} : Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Applied to files:

  • tensorrt_llm/serve/harmony_adapter.py
🧬 Code graph analysis (3)
tensorrt_llm/executor/worker.py (3)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • cp_size (38-39)
tensorrt_llm/logger.py (1)
  • warning (131-132)
tensorrt_llm/executor/result.py (1)
  • prompt_token_ids (525-526)
tensorrt_llm/serve/harmony_adapter.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (10)
  • ChatCompletionRequest (483-665)
  • ChatCompletionResponse (424-433)
  • ChatCompletionResponseChoice (410-421)
  • ChatCompletionResponseStreamChoice (445-451)
  • ChatCompletionStreamResponse (454-460)
  • ChatMessage (351-356)
  • DeltaFunctionCall (332-334)
  • DeltaMessage (436-442)
  • DeltaToolCall (344-348)
  • UsageInfo (62-65)
tensorrt_llm/executor/result.py (3)
  • request_id (521-522)
  • token_ids_diff (140-141)
  • prompt_token_ids (525-526)
tensorrt_llm/serve/openai_server.py (3)
tensorrt_llm/serve/harmony_adapter.py (6)
  • HarmonyAdapter (278-1406)
  • handle_non_streaming_response (1486-1540)
  • handle_streaming_response (1409-1483)
  • maybe_transform_reasoning_effort (1587-1598)
  • openai_to_harmony_tokens (537-729)
  • get_stop_tokens (320-326)
tensorrt_llm/serve/openai_protocol.py (3)
  • ChatCompletionRequest (483-665)
  • to_sampling_params (260-302)
  • to_sampling_params (590-631)
tensorrt_llm/llmapi/llm.py (2)
  • tokenizer (691-695)
  • tokenizer (698-699)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/harmony_adapter.py

33-36: 1 blank line required between summary line and description

(D205)


81-83: 1 blank line required between summary line and description

(D205)


269-271: 1 blank line required between summary line and description

(D205)


317-320: 1 blank line required between summary line and description

(D205)


412-414: 1 blank line required between summary line and description

(D205)


540-540: Line too long (123 > 120)

(E501)


556-556: Line too long (126 > 120)

(E501)


728-729: One-line docstring should fit on one line

Reformat to one line

(D200)


741-743: 1 blank line required between summary line and description

(D205)


901-903: 1 blank line required between summary line and description

(D205)


1326-1328: 1 blank line required between summary line and description

(D205)


1344-1346: 1 blank line required between summary line and description

(D205)

tensorrt_llm/serve/openai_server.py

671-673: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
examples/models/core/gpt_oss/README.md (1)

127-138: Confirm TRITON MoE backend toggle matches current CLI/YAML.

Double-check the backend key (moe_config.backend: TRITON) and CLI flag (--moe_backend TRITON) are the canonical names used by the server; otherwise users will churn.

Would you like me to scan the repo for all places that parse moe_backend to ensure the doc matches?

tensorrt_llm/serve/openai_protocol.py (4)

332-349: Delta models now inherit OpenAIBaseModel — good.

Consistent Pydantic config across models; no issues spotted.


351-357: ChatMessage optional content/reasoning — good.

This accommodates Harmony reasoning and empty chunks.


396-404: ReasoningAssistantMessage addition — good.

Extends message types without breaking existing aliases.


495-517: ReasoningEffort, max_completion_tokens alias — good.

Fields and defaults look correct.

Copy link
Collaborator

@JunyiXu-nv JunyiXu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16674 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16668 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16674 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12516 completed with status: 'FAILURE'

@LinPoly
Copy link
Collaborator Author

LinPoly commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16686 [ run ] triggered by Bot

Copy link
Collaborator

@nv-guomingz nv-guomingz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16686 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12526 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants