-
Notifications
You must be signed in to change notification settings - Fork 1.6k
doc: update known issues #6247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: update known issues #6247
Conversation
Signed-off-by: junq <[email protected]>
WalkthroughThe release notes documentation has been updated to mention a new known issue: while full chunked attention support for LLaMA4 models now allows handling sequences longer than 8K tokens, there is a known performance regression. The underlying cause is identified and a fix is planned for a future release. Changes
Estimated code review effort1 (~2 minutes) Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/source/release-notes.md (1)
76-76
: Use consistent model naming & tighten wordingElsewhere in the notes the model is referred to as “Llama 4” or “llama 4”. Using a third variant (
LLaMA4
) here is jarring. While touching the line, the sentence can read more crisply.- Full chunked attention support has been added for LLaMA4 to handle >8K sequences, with a known performance regression. The root cause is identified and will be fixed in a future release. + While full chunked-attention support for Llama 4 now enables sequences > 8 k tokens, it currently suffers from a known performance regression. The root cause is understood and a fix is planned for a future release.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/source/release-notes.md
(1 hunks)
🧠 Learnings (1)
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧰 Additional context used
🧠 Learnings (1)
docs/source/release-notes.md (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
/bot skip --comment "doc changes" |
PR_Github #12551 [ skip ] triggered by Bot |
PR_Github #12551 [ skip ] completed with state |
@QiJune @juney-nvidia can we amend the comment to specify which scenarios are affected by the perf issue (ie specifically for seq Len <8k) |
In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release @laikhtewari Are you OK with above statement or need to add more words? |
cc @nv-yilinf to comment on the known issue. |
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
Summary by CodeRabbit