Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

Alternative fix to: #26498

Simplify the padded drafter batch fix by adjusting seq_lens and seq_lens_cpu
inside the drafting loop at token_index==0, rather than using complex mask
calculations.

This addresses the acceptance rate issue outlined in vllm-project#26191 where AL is
reduced by about 5% when long speculative sequences are used.

Co-authored-by: Benjamin Chislett <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>
Co-authored-by: Benjamin Chislett <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant