Replies: 3 comments 2 replies
-
@JacobSzwejbka any ideas on how one might do this? |
Beta Was this translation helpful? Give feedback.
-
I think we need to adjust our runner. Rest of the capabilities do exist already. Although how long of a cotnext to keep is a bit of a question and works in. kv cache quant would help here |
Beta Was this translation helpful? Give feedback.
-
In the android app we should always use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a short question. I build and uploaded an android app deploying LLama3 (https://bwsyncandshare.kit.edu/s/t3898Ge7AZ6SWBn).
However, I couldnt get the model to continue using the last conversation turns.
I assume that the kv cache is stored in module_ internally and here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L175
only the last decoded token and the position index of that token is given to the model. Is that correct?
To use the last conversation turns within the next prompt I tried to start here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L277
not with 0 as start position index but with the number of tokens which were decoded during the last conversation turns. However, that didn't work, because the model didn't remember the last conversations (I tried e.g. "My name is Christian" -> answer -> "What is my name?"). Is my approach wrong?
For performance reasons I don't want to give the whole conversation history multiple times to the model.
Best,
Christian
Beta Was this translation helpful? Give feedback.
All reactions