Skip to content

Conversation

belericant
Copy link

What does this PR do?

This PR adds the feature requested in #30758. The HHCache class is almost directly taken from the original H2O paper's authors code found here. Currently the PR only adds the changes required to Llama model class. As of now I have taken @gante 's suggestion of adding Cache.post_process() and calling it within LlamaAttention.forward.

To-Do

  1. I'm not sure if the logic for RoPE rerotation is 100% correct. I think the recent tokens are correct, but not the hh tokens after eviction. Would love to have another set of eyes on that.
  2. Write tests to ensure that this HHCache class has the same behavior compared to the original code by paper authors.
  3. Benchmarking(?)

Feedback and/or help would be appreciated. Thanks!

@amyeroberts
Copy link
Collaborator

cc @gante @ArthurZucker

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Aug 3, 2024
@ArthurZucker ArthurZucker reopened this Aug 5, 2024
@ArthurZucker
Copy link
Collaborator

RE-opened as we were waiting for the todos, @belericant should I close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants