Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since we now support better parallelization of the attention via "streams" (see #14363) I was planning to add an alternative approach for computing multi-sequence embeddings. However, I am starting to doubt it would have any benefit compared to the existing method, so opening this PR/discussion for feedback.
Existing approach
Currently we put all tokens from all sequences in a single
ubatch
and process this with a masked cross-sequence attention in a single stream. For example the ubatch of 4 sequences with different lengths could looks like this:New approach
The idea that I had, which I believe other implementation also use, is to pad the sequences to an equal length:
# x is a padding token - i.e. it does not attend to nothing and is not attended by any 0000000000xxxxx 111xxxxxxxxxxxx 222222222222222 33333xxxxxxxxxx
We can process this batch with 4 streams in the attention.
Observations
llama_encode()
input batch would become more complicated for the user because they have to take into account the padding in order to not exceedn_ubatch
when it is appliedsplit_equal()
approach is because non-causal encoding requires to process all tokens of a sequence in a single ubatchIf you have any thoughts about this let me know. For now, will probably postpone this until I'm more convinced it would be useful.