-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][feat] add gpt-osss tests to sanity list #7158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughReorganized and expanded the LLM sanity test manifest at Changes
Sequence Diagram(s)(omitted — changes are manifest/test-list edits without control-flow feature changes) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
25f3ed0
to
b21ad00
Compare
/bot run |
PR_Github #16161 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/qa/llm_function_sanity.txt
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/test_lists/qa/llm_function_sanity.txt (4)
154-159
: Good move: covering disaggregated UCX and TRTLLM sampler in sanity.Adding
test_disaggregated_deepseek_v3_lite_fp8_ucx
andtest_disaggregated_trtllm_sampler
improves early signal on transport and sampler regressions.
137-143
: Naming consistency confirmed for W4A16 and W4A8 Qwen testsI’ve checked the
test_w4a16_mxfp4
andtest_w4a8_mxfp4
definitions intests/integration/defs/accuracy/test_llm_api_pytorch.py
and theirids=
decorators:
test_w4a16_mxfp4
is parameterized withids=["latency-TRTLLM"]
, matching the[latency-TRTLLM]
entry in the sanity list (Line 137).- In
TestQwen3_30B_A3B
,test_w4a8_mxfp4
uses a two‐stage parametrization:
- Backends:
["CUTLASS","TRITON","TRTLLM"]
for the static-fp8 path and["CUTLASS","TRTLLM"]
for the MXFP8 path, producing IDs like"latency-CUTLASS"
,"latency-TRITON"
, and"latency-TRTLLM"
.- Activation types
["static_fp8","mxfp8"]
withids=["fp8","mxfp8"]
.
These combine to exactly the five entries shown:
[fp8-latency-CUTLASS]
[fp8-latency-TRITON]
[fp8-latency-TRTLLM]
[mxfp8-latency-CUTLASS]
[mxfp8-latency-TRTLLM]
(Lines 138–142 in the sanity list.)Everything lines up with the fixture expectations—no mismatches in “mxfp8” vs “fp8” or backend casing. No further action needed.
181-182
: Multimodal Gemma quickstarts added to sanity — confirm asset availability/skip markersOur inspection of
tests/integration/defs/test_e2e.py
shows that thegemma-3-27b-it
parameters fortest_ptp_quickstart_multimodal
do not have any skip/xfail guards for missing model artifacts (we only see memory‐based skips such asskip_less_device_memory(80000)
).Please verify one of the following:
- Your CI environment guarantees that the
gemma-3-27b-it
artifacts live at$LLM_MODELS_ROOT/gemma/gemma-3-27b-it
(so the quickstart can pull them at runtime),
OR- Add a skip marker around the
pytest.param("gemma-3-27b-it", …)
entry intest_ptp_quickstart_multimodal
—for example:@pytest.mark.parametrize("model_name,model_path", [ # … - ("gemma-3-27b-it", "gemma/gemma-3-27b-it"), + pytest.param( + "gemma-3-27b-it", + "gemma/gemma-3-27b-it", + marks=pytest.mark.skipif( + not os.path.isdir(os.path.join(llm_models_root(), "gemma/gemma-3-27b-it")), + reason="Gemma-3-27B-IT artifacts unavailable" + ) + ), # … ]) def test_ptp_quickstart_multimodal(llm_root, llm_venv, …):This will ensure the sanity suite only runs the Gemma multimodal tests when those assets are present.
48-61
: Verified — GPTOSS test parametrizations match the fixturesI inspected tests/integration/defs/accuracy/test_llm_api_pytorch.py and confirmed the added GPT-OSS tests use the expected parameter names and enum values (no case mismatches).
Key locations:
- tests/integration/defs/accuracy/test_llm_api_pytorch.py — class TestGPTOSS (around line 2680).
- test_w4_1gpu parametrization: lines ~2692–2697 (moe_backend=["CUTLASS","TRTLLM","TRITON"], ids=["cutlass","trtllm","triton"]; cuda_graph/overlap_scheduler set to (True, True)).
- test_w4_4gpus parametrization: lines ~2720–2726 (same moe_backend values; tp/pp/ep combos with ids like "tp4", "ep4", "dp4").
- test_w4a16 signature/params: lines ~2753–2759 (presence confirmed).
Conclusion: enums and parameter names are correct as implemented; no changes required before merge.
b21ad00
to
98efcf7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
1-201
: Blocking: All entries appear to reference files outside the repo-visible paths (need prefix correction).Previous verification already showed unresolved paths for every entry. Unless the repo’s test discovery intentionally runs from tests/integration/, you likely need to prefix most paths:
- accuracy/... → tests/integration/accuracy/...
- disaggregated/... → tests/integration/disaggregated/...
- test_e2e.py → tests/integration/test_e2e.py
Fix paths first; otherwise, the sanity stage will schedule 0 runnable tests or fail fast during collection.
Run this to auto-rewrite and re-verify:
#!/bin/bash set -euo pipefail f=tests/integration/test_lists/qa/llm_function_sanity.txt # 1) Normalize paths tmp=$(mktemp) sed -E \ -e 's@^accuracy/@tests/integration/accuracy/@' \ -e 's@^disaggregated/@tests/integration/disaggregated/@' \ -e 's@^test_e2e\.py@tests/integration/test_e2e.py@' \ "$f" > "$tmp" && mv "$tmp" "$f" # 2) Check file/class/test existence (ignoring param brackets) missing=0 while IFS= read -r line || [[ -n "$line" ]]; do [[ -z "$line" || "$line" =~ ^# ]] && continue path="${line%%::*}" rest="${line#*::}" class="${rest%%::*}" test_with_params="${rest#*::}" test="${test_with_params%%[*}" if [[ ! -f "$path" ]]; then echo "MISSING FILE: $path <-- $line" missing=1; continue fi if ! rg -nP "^\s*class\s+${class}\b" "$path" >/dev/null; then echo "MISSING CLASS: $class in $path <-- $line" missing=1 fi if ! rg -nP "^\s*def\s+${test}\b" "$path" >/dev/null; then echo "MISSING TEST: $test in $path <-- $line" missing=1 fi done < "$f" if [[ $missing -eq 1 ]]; then echo "One or more entries are unresolved." >&2 exit 1 else echo "All sanity entries resolve to existing classes/tests." fi
88-91
: Remove duplicated manifest entries to avoid double scheduling.These lines are exact duplicates of neighbors and will run twice:
- Lines 88–91: TestLlama4MaverickInstruct::test_chunked_prefill for FLASHINFER and TRTLLM
- Lines 114–117: TestMinistral8BInstruct::test_auto_dtype and ::test_fp8
- Lines 146–147: disaggregated_cache_aware_balance for TinyLlama
Delete the second occurrence of each.
Apply:
--- a/tests/integration/test_lists/qa/llm_function_sanity.txt +++ b/tests/integration/test_lists/qa/llm_function_sanity.txt @@ -88,6 +88,4 @@ accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] -accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM] -accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM] @@ -114,4 +112,2 @@ accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype -accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8 -accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8 @@ -146,2 +142,1 @@ disaggregated/test_disaggregated.py::test_disaggregated_cache_aware_balance[TinyLlama-1.1B-Chat-v1.0] -disaggregated/test_disaggregated.py::test_disaggregated_cache_aware_balance[TinyLlama-1.1B-Chat-v1.0]To scan for any stragglers:
awk '{c[$0]++; if(c[$0]==2) print "DUP", NR, $0}' tests/integration/test_lists/qa/llm_function_sanity.txtAlso applies to: 114-117, 146-147
🧹 Nitpick comments (10)
tests/integration/test_lists/qa/llm_function_sanity.txt (10)
18-26
: Llama3.1 8B disaggregated: heavy tp/pp/eagle/ngram matrix may not be “sanity”.This block explodes combinations across GSM8K/MMLU and tp/pp/eagle/ngram. Consider sampling 1–2 reps and moving the rest to extended suites to keep sanity time bounded.
I can propose a minimal subset that still exercises each axis (tp, pp, eagle, ngram) with ≤4 lines.
40-42
: DeepSeekV3Lite BF16/FP8/NVFP4 variants: multi-GPU and block-scale paths are expensive.These add 4-GPU and block-scale quant configs. If sanity budget is tight, prefer single-GPU BF16 + one NVFP4 variant, and leave 4-GPU/block-scales for throughput/accuracy stages.
Please confirm whether the sanity stage maps to 4-GPU nodes; if not, these will fail or be silently skipped by capacity.
Also applies to: 45-46
48-61
: New GPT-OSS W4/W4A16 grid: strong coverage, but trim for sanity.Nine 4-GPU grid points (dp/ep/tp × CUTLASS/TRITON/TRTLLM) + single-GPU tri-backend entries will balloon runtime. Keep a representative backend per parallelism mode in sanity.
If you want, I’ll provide a reduced set: {1×W4 single-GPU, 1×tp4, 1×ep4, 1×dp4} with alternating backends.
78-79
: eagle3_tp8 on 70B: extremely heavy for a sanity lane.tp8 on 70B typically requires large nodes and long warmup. Recommend moving to an extended/throughput lane; keep a tp4 smoke if needed.
85-101
: Llama4MaverickInstruct TP8/EP8 grids: likely to exceed sanity SLAs.Multiple tp8/ep8 combinations plus fp8 and chunked-prefill—great coverage, but risky for the short sanity lane. Suggest:
- Keep one tp4 and one tp8 smoke (no ep).
- Keep exactly one fp8 case.
- Move the rest to accuracy/throughput.
Confirm target runtime budget for sanity on your CI (minutes per PR).
105-113
: Llama4ScoutInstruct: validate FP4 and FP8 combos; ensure artifacts exist.
- test_fp4_chunked_prefill and test_fp4 appear; confirm both are supported in Scout’s test module and artifacts are available for tp8ep8.
- Similar trimming advice as above if sanity runtime is a concern.
120-120
: MistralSmall24B added to sanity: sanity budget check.24B auto-dtype is fine, but if nodes are constrained, consider keeping 7B/8B families in sanity and moving 24B up-stack.
124-124
: Phi4MiniInstruct: ensure artifacts are in cache for sanity pool.Phi variants sometimes fetch larger weights; cache misses will spike runtime. Warm cache or swap to a tiny variant for sanity.
181-182
: Multimodal Gemma3 27B additions: verify data assets and CI node types.These tests often need image/video sample assets and sometimes larger GPU mem. Confirm assets are packaged and nodes meet requirements in the sanity pool.
1-201
: Right-size the sanity manifest; propose split into sanity vs extended sanity.Given the volume of multi-GPU (tp8/ep8), block-scale, overlap-scheduler, and disaggregated transports, consider:
- Keep 1–2 per family/back-end axis in sanity.
- Move the rest into a new tests/integration/test_lists/qa/llm_function_extended.txt and add a CI stage (or reuse existing “accuracy/throughput”) triggered via /bot run --stage-list.
This keeps PR turnaround fast while preserving coverage elsewhere.
If you want, I’ll draft a trimmed sanity list plus an extended list derived from this manifest.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/qa/llm_function_sanity.txt
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/test_lists/qa/llm_function_sanity.txt (6)
2-2
: DeepSeekV3Lite: verify scheduler/backends are supported in sanity environment.
- overlap_scheduler=True entries can shift runtime; ensure the runner nodes enable and exercise overlap logic.
- test_nixl_backend requires NIXL plumbing; gate or skip if the sanity pool lacks NIXL.
Would you like a list-level marker or pytest skipif to avoid NIXL on non-supported agents?
Also applies to: 4-5
68-68
: Confirm presence of LLM sampler test.Ensure TestLlama3_1_8BInstruct::test_fp8_llm_sampler exists and isn’t an artifact of a renamed test (common: test_fp8_sampler or test_llm_sampler_fp8).
I can extend the verification script to assert exact method names for these specific classes only.
114-117
: Ministral vs Mistral: verify class naming is intentional.If the class is actually TestMistral8BInstruct, these entries will not resolve. If “Ministral” is a deliberate internal model, ignore.
I can auto-check for both spellings and report which exists.
137-137
: Qwen3_30B_A3B “w4a8_mxfp4” entries param mismatch suspicion.Function name suggests “mxfp4” but params include mxfp8/fp8 in brackets:
- test_w4a16_mxfp4[latency-TRTLLM] is fine.
- test_w4a8_mxfp4[mxfp8-…] seems contradictory.
If the bracket arg is a separate dimension (e.g., kv dtype), it’s OK. Otherwise, this may select the wrong config.
I can grep the test’s param ids to confirm the intended tuples.
Also applies to: 142-142
154-154
: UCX/NIXL disaggregated variants: gate by environment.test_disaggregated_deepseek_v3_lite_fp8_ucx and _nixl should be conditionally included only when the agent supports these transports; otherwise they’ll fail/skip in sanity.
I can add a tiny wrapper list or mark to toggle these via /bot run --stage-list or a pytest skipif.
159-159
: New TRT-LLM sampler test: confirm stability in sanity.Samplers can be flaky with aggressive settings; ensure it’s already in passing state to avoid PR noise.
98efcf7
to
e1dbb54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/integration/test_lists/qa/llm_function_sanity.txt (4)
1-6
: Confirm path resolution for all manifest entries before mergePast validation indicated these relative paths did not resolve to real files/classes/methods. Please re-run a resolver and fix any misses (e.g., missing
tests/integration/
prefix or moved files) before merging.Run this script from repo root to verify file/class/test existence and catch whitespace/dup issues:
#!/bin/bash set -euo pipefail file="tests/integration/test_lists/qa/llm_function_sanity.txt" missing=0 while IFS= read -r line || [[ -n "$line" ]]; do [[ -z "$line" || "$line" =~ ^# ]] && continue line="${line%%[[:space:]]}" # rstrip path="${line%%::*}" rest="${line#*::}" class="${rest%%::*}" test_with_params="${rest#*::}" test="${test_with_params%%[*}" if [[ ! -f "$path" ]]; then echo "MISSING FILE: $path <-- $line" missing=1; continue fi if ! rg -nP "^\s*class\s+${class}\b" "$path" >/dev/null; then echo "MISSING CLASS: $class in $path <-- $line" missing=1 fi if ! rg -nP "^\s*def\s+${test}\b" "$path" >/dev/null; then echo "MISSING TEST: $test in $path <-- $line" missing=1 fi done < "$file" if [[ $missing -eq 1 ]]; then echo "✖ One or more entries unresolved." >&2; exit 1 else echo "✓ All entries resolve." fi
85-101
: Remove duplicated Llama4Maverick chunked_prefill entriesThese two lines duplicate existing entries and will double-run:
- Line 88 duplicates Line 89 (FLASHINFER)
- Line 91 duplicates Line 90 (TRTLLM)
Apply this minimal diff:
-accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] -accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM]After editing, run:
awk '{cnt[$0]++; if (cnt[$0]==2) printf("DUP:%6d %s\n", NR, $0)}' tests/integration/test_lists/qa/llm_function_sanity.txt | sort
114-117
: Drop duplicate Ministral8BInstruct entries
- Line 114 duplicates Line 115 (auto_dtype)
- Line 117 duplicates Line 116 (fp8)
Apply:
-accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype -accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8Also, confirm the class name spelling “Ministral” is intentional (there are also “Mistral” classes in this file).
146-147
: Remove duplicate disaggregated_cache_aware_balance entryLine 146 duplicates Line 147. Remove one occurrence to avoid double-run.
-disaggregated/test_disaggregated.py::test_disaggregated_cache_aware_balance[TinyLlama-1.1B-Chat-v1.0]
🧹 Nitpick comments (7)
tests/integration/test_lists/qa/llm_function_sanity.txt (7)
18-26
: Large new Llama3.1-8B disaggregated matrix—consider trimming for sanity stageEAGLE3, NGRAM, and multiple tp/pp symmetric variants can materially increase runtime. For sanity, prefer 1-2 representative configs and move the rest to extended/accuracy suites.
78-79
: EAGLE3 tp8 for 70B will extend runtime considerablyFor the sanity stage, consider deferring tp8 variants to throughput/extended. Keep a single tp4 representative here.
85-101
: Many new Maverick FP8/auto-dtype variants—trim for sanityConsider keeping a minimal subset (e.g., one tp4 no-cudagraph, one tp8 with cudagraph, one fp8 eagle3) and move the grid to accuracy/throughput.
105-113
: Scout FP4/FP8 grid likely heavy for sanitySame suggestion: keep 1–2 representatives (e.g., fp4 tp4ep4 with cudagraph, fp8 tp8ep8) and relocate the remainder.
128-128
: Qwen3 235B NVFP4 4-GPU eagle3/moe: too heavy for sanityRecommend moving to extended/perf; keep a lighter NVFP4 single-node variant in sanity.
181-182
: New multimodal Gemma e2e entries—check asset availability and runtimeConfirm image/video fixtures are small and bundled/cached; otherwise these can bloat sanity stage. If large, move to multimodal-quickstart or extended sanity.
1-201
: Automate duplicate and resolution checks in CI to prevent regressionsConsider adding a lightweight presubmit that fails on unresolved entries and duplicate lines for all test-list manifests.
Here’s a simple guard you can drop under a repo script and call from CI:
#!/bin/bash set -euo pipefail rc=0 for f in tests/integration/test_lists/**/*.txt; do echo "Checking $f" # duplicates if awk '{cnt[$0]++} END{for (l in cnt) if (cnt[l]>1 && l !~ /^#/ && l !~ /^[[:space:]]*$/) {print "DUP: " cnt[l] "x " l; exit 1}}' "$f"; then :; else rc=1; fi # resolution while IFS= read -r line || [[ -n "$line" ]]; do [[ -z "$line" || "$line" =~ ^# ]] && continue path="${line%%::*}"; rest="${line#*::}"; cls="${rest%%::*}"; test="${rest#*::}"; test="${test%%[*}" [[ -f "$path" ]] || { echo "MISSING FILE in $f: $path"; rc=1; continue; } rg -nP "^\s*class\s+${cls}\b" "$path" >/dev/null || { echo "MISSING CLASS in $f: $cls @ $path"; rc=1; } rg -nP "^\s*def\s+${test}\b" "$path" >/dev/null || { echo "MISSING TEST in $f: $test @ $path"; rc=1; } done < "$f" done exit $rc
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/qa/llm_function_sanity.txt
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
🔇 Additional comments (9)
tests/integration/test_lists/qa/llm_function_sanity.txt (9)
30-30
: nixl backend addition: confirm infra readinessMake sure the CI agents have nixl runtime and required env vars; otherwise these will be skipped or flake.
40-46
: DeepSeekV3Lite additions span FP8/NVFP4/bfloat16 with various toggles—validate param namesPlease verify all parametrization keys (attention_dp, cuda_graph, torch_compile, overlap_scheduler, fp8kv, moe_backend, mtp_nextn/mtp) match the pytest parameter names to avoid silent deselection.
48-60
: Nice: GPT-OSS coverage added across 1-GPU and 4-GPU backendsThese hit the PR objective directly and look well-scoped (w4, w4a16). LGTM from manifest perspective.
68-68
: LLM sampler in FP8: ensure determinism/tolerance fit sanitySampler randomness can cause flakes in short sanity. Pin seeds and/or use tolerant asserts in the target test if not already.
120-120
: Add Mistral-Small-24B auto-dtype to sanity—looks goodReasonable lightweight coverage. No further action.
124-124
: Phi4MiniInstruct auto-dtype addition—LGTMAdds breadth without much runtime impact.
137-142
: W4A16/W4A8-MXFP4 TRTLLM combos—verify backend support and artifactsPlease ensure quantized artifacts are available in CI cache and TRT-LLM backend knobs are valid for these exact tuples.
154-154
: UCX variant: confirm CI UCX stack and network fabric are presentUCX tests can be environment-sensitive. Verify UCX packages, env flags (e.g.,
UCX_TLS
), and runner permissions are set on sanity lanes.
159-159
: TRTLLM sampler sanity: ensure seed/control to avoid flakinessIf assertions compare token sequences, enforce deterministic seeds or compare distributions rather than exact strings.
Signed-off-by: Xin He (SW-GPU) <[email protected]>
e1dbb54
to
20a9525
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/integration/test_lists/qa/llm_function_sanity.txt (4)
1-201
: All entries likely unresolved: paths appear to miss the tests/integration/ prefixBased on repo conventions, test files typically live under tests/integration/**. Entries like
accuracy/...
,disaggregated/...
, andtest_e2e.py
will not resolve unless the path prefix is included. Please verify and auto-fix.Run this verification + auto-fix helper. It first checks resolution, then offers a preview rewrite that prepends
tests/integration/
for any entry not already undertests/
:#!/bin/bash set -euo pipefail list="tests/integration/test_lists/qa/llm_function_sanity.txt" echo "== Checking resolution ==" missing=0 while IFS= read -r line || [[ -n "$line" ]]; do [[ -z "$line" || "$line" =~ ^# ]] && continue path="${line%%::*}" rest="${line#*::}"; class="${rest%%::*}"; test_with_params="${rest#*::}"; test="${test_with_params%%[*}" if [[ ! -f "$path" ]]; then echo "MISSING FILE: $path <-- $line" missing=1 else # Class and test definitions if ! rg -nP "^\s*class\s+${class}\b" "$path" >/dev/null; then echo "MISSING CLASS: $class in $path <-- $line"; missing=1 fi if ! rg -nP "^\s*def\s+${test}\b" "$path" >/dev/null; then echo "MISSING TEST: $test in $path <-- $line"; missing=1 fi fi done < "$list" [[ $missing -eq 0 ]] && echo "All entries resolve." echo -e "\n== Preview rewrite (no files changed) ==" awk '{ if ($0 ~ /^[[:space:]]*#/ || $0 ~ /^[[:space:]]*$/) {print; next} split($0, a, "::"); p=a[1] if (p !~ /^tests\//) {print "tests/integration/" $0} else {print $0} }' "$list" | head -n 60 # To apply rewrite (make a backup first), uncomment: # cp "$list" "$list.bak" # awk '{ # if ($0 ~ /^[[:space:]]*#/ || $0 ~ /^[[:space:]]*$/) {print; next} # split($0, a, "::"); p=a[1] # if (p !~ /^tests\//) {print "tests/integration/" $0} else {print $0} # }' "$list" > "$list.tmp" && mv "$list.tmp" "$list"
88-91
: Duplicate Llama4MaverickInstruct::test_chunked_prefill entries (FLASHINFER, TRTLLM)These lines appear twice and will double-schedule in sanity.
Apply this diff to drop the second occurrences:
--- a/tests/integration/test_lists/qa/llm_function_sanity.txt +++ b/tests/integration/test_lists/qa/llm_function_sanity.txt @@ -88,7 +88,6 @@ accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] -accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=FLASHINFER] -accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM] +accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM] accuracy/test_llm_api_pytorch.py::TestLlama4MaverickInstruct::test_chunked_prefill[attn_backend=TRTLLM]
114-117
: Duplicate Ministral8BInstruct auto_dtype and fp8 entriesTwo duplicates were introduced; remove the extras to avoid redundant runs.
--- a/tests/integration/test_lists/qa/llm_function_sanity.txt +++ b/tests/integration/test_lists/qa/llm_function_sanity.txt @@ -114,7 +114,5 @@ accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype -accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_auto_dtype accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8 -accuracy/test_llm_api_pytorch.py::TestMinistral8BInstruct::test_fp8
146-147
: Duplicate disaggregated_cache_aware_balance entryThis test is listed twice.
--- a/tests/integration/test_lists/qa/llm_function_sanity.txt +++ b/tests/integration/test_lists/qa/llm_function_sanity.txt @@ -146,7 +146,6 @@ disaggregated/test_disaggregated.py::test_disaggregated_cache_aware_balance[TinyLlama-1.1B-Chat-v1.0] -disaggregated/test_disaggregated.py::test_disaggregated_cache_aware_balance[TinyLlama-1.1B-Chat-v1.0]
🧹 Nitpick comments (6)
tests/integration/test_lists/qa/llm_function_sanity.txt (6)
5-5
: NIXL/UCX backends: ensure environment guards exist or gate by stageThese entries can fail on CI hosts lacking NIXL/UCX runtime. Ensure the underlying tests have
pytest.skipif
guards or only include them in a stage that provisions those components.Also applies to: 30-30, 154-154
95-101
: Large FP8/FP4 matrices added for Llama4 families; prefer representative set in sanityThe exhaustive tp/ep/cuda_graph/tc combinations will inflate runtime. Keep one per family per backend in sanity; move the matrix to accuracy/throughput suites.
Also applies to: 108-113
181-182
: New multimodal quickstarts: confirm asset availability and cachingE2E multimodal tests often fetch media assets and models. Please confirm they use cached sample data and models to avoid network dependency in sanity.
18-26
: Nice addition: eagle3/ngram/tp_pp_symmetric coverage on Llama3.1 8BGood targeted coverage; please keep only one symmetric tp/pp combo per benchmark in sanity to stay lean.
Also applies to: 65-73
1-201
: Final hygiene: detect any remaining duplicates automatically in CILet’s add a quick CI check to fail on duplicates in sanity manifests to prevent regressions.
Drop this script under
tools/ci/check_no_dups.sh
and wire it in your sanity job:#!/bin/bash set -euo pipefail rc=0 for f in tests/integration/test_lists/qa/*.txt; do dups=$(awk '{cnt[$0]++} END {for (k in cnt) if (cnt[k]>1 && k !~ /^[[:space:]]*#/ && length(k)>0) print cnt[k] "x " k}' "$f") if [[ -n "$dups" ]]; then echo "Duplicate entries in $f:" echo "$dups" rc=1 fi done exit $rc
40-46
: Trim down sanity-stage tests for faster CI cyclesRunning the provided helper script uncovered 106 “heavy” entries in tests/integration/test_lists/qa/llm_function_sanity.txt matching the multi-GPU/backend patterns (tp4/tp8, ep4/ep8, 4gpus, eagle3, nvfp4, fp8, block_scale, ucx, nixl). To keep the short sanity lane fast and representative:
- Pick a single, minimal variant per model/backend (e.g. default dtype on 1 GPU) and move all other heavy variants to an extended-sanity list.
- Create tests/integration/test_lists/qa/llm_function_extended_sanity.txt and relocate tests such as:
- TestDeepSeekV3Lite’s fp8_block_scales_4gpus_static_eplb, nvfp4_4gpus_online_eplb[…]
- All tp4/tp8 and ep4/ep8 variants across LLama3, LLama4, Qwen, Mixtral, etc.
- Eagle3, NIXL, UCX backends in accuracy/test_disaggregated_serving and elsewhere.
- You can re-use this command to list candidates anytime:
awk 'BEGIN{IGNORECASE=1} /tp[48]|ep[48]|4gpus|eagle3|nvfp4|fp8|block_scale|ucx|nixl/ { printf("%5d %s\n", NR, $0) }' tests/integration/test_lists/qa/llm_function_sanity.txtAlso applies to other sections of the file (e.g. lines 78–87, 105–113, 120, 128, 137–142, 154, 159, 181–182).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/qa/llm_function_sanity.txt
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_sanity.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
48-61
: Audit Complete: No Duplicates & Parameter Matrix Is Correct
- All 13
TestGPTOSS
entries on lines 48–60 are unique—no exact duplicates detected.- Coverage aligns with the intended matrix:
- W4 1-GPU: CUTLASS, TRITON, TRTLLM (True-True)
- W4 4-GPU: dp4/ep4/tp4 × CUTLASS, TRITON, TRTLLM
- W4A16: dp4
- Sanity slice remains minimal and balanced across backends.
No further changes needed.
137-142
: Param/value mismatch: test_w4a8_mxfp4[...] shows mxfp8 param — please verifyEntries like
test_w4a8_mxfp4[mxfp8-...]
mix the function name “mxfp4” with the param valuemxfp8
. Please confirm whethermxfp8
is a valid/intentional dtype id for this test; if not, either fix the param value or rename the test (e.g.test_w4a8_mx[...]
) to avoid confusion.Files/locations to check:
- tests/integration/test_lists/qa/llm_function_sanity.txt — lines ~137-142 (listed test entries).
- accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_w4a8_mxfp4 — verify pytest parametrization/ids.
- tests/unittest/_torch/thop/test_w4a8_mxfp4_mxfp8_gemm.py — contains related test names (ensure naming/ids consistent).
Quick check command (example): rg -n "mxfp8|mxfp4|test_w4a8_mxfp4" -S tests
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.