Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 28, 2025

Since we now support better parallelization of the attention via "streams" (see #14363) I was planning to add an alternative approach for computing multi-sequence embeddings. However, I am starting to doubt it would have any benefit compared to the existing method, so opening this PR/discussion for feedback.

Existing approach

Currently we put all tokens from all sequences in a single ubatch and process this with a masked cross-sequence attention in a single stream. For example the ubatch of 4 sequences with different lengths could looks like this:

000000000011122222222222222233333

New approach

The idea that I had, which I believe other implementation also use, is to pad the sequences to an equal length:

# x is a padding token - i.e. it does not attend to nothing and is not attended by any
0000000000xxxxx
111xxxxxxxxxxxx
222222222222222
33333xxxxxxxxxx

We can process this batch with 4 streams in the attention.

Observations

  • The new approach might be a bit more efficient, though with embeddings we usually have relatively short sequences anyway. So probably the performance would be fine either way
  • The computation in the non-attention operators (FFN, norms, etc.) will increase because of the extra padding tokens
  • The logic for passing the llama_encode() input batch would become more complicated for the user because they have to take into account the padding in order to not exceed n_ubatch when it is applied
  • Note that the reason we cannot use the existing split_equal() approach is because non-causal encoding requires to process all tokens of a sequence in a single ubatch

If you have any thoughts about this let me know. For now, will probably postpone this until I'm more convinced it would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant