Skip to content

Conversation

@aria42
Copy link

@aria42 aria42 commented Dec 30, 2021

Was testing out the SentencePieceModel tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at the code for the decode_forward pass it seems you are considering candidate spans (char_start, char_end) with arbitrary length, even though the vocabulary will have some maximum length element. Constraining decoding to consider spans of at most this max length yields the same result, since no longer substring will be present in vocab. This change has a dramatic impact for tokenizing longer pieces of text. This PR addresses the problem by computing a max_vocab_codeunit_len for the SentencePieceModel to cache the longest code unit length for any vocabulary element. This field is used to truncate search for decoding.

This gist below highlights the performance gap. There is existing test coverage for this code and those tests still pass, but happy to add more if there's something else to test in the implementation.

using HTTP
using WordTokenizers

spm = load(ALBERT_V1)
# Download Hamlet text and truncate to roughly first 5k bytes
long_text = String(HTTP.get("https://dlg.usg.edu/record/dlg_zlgb_gb5027/fulltext.text").body)
max_len = 5000
long_text = long_text[begin:thisind(long_text, max_len)]
@time spm(long_text)

Before this PR: 7.098319 seconds (195.33 k allocations: 11.746 MiB, 1.41% compilation time)
With this PR: 0.016252 seconds (8.23 k allocations: 1.026 MiB)

@aria42
Copy link
Author

aria42 commented Jan 7, 2022

Thanks @aviks assuming this means I can merge PR, or do I need another approver?

@aria42
Copy link
Author

aria42 commented Jan 24, 2022

@aviks I think you need to merge since I can't do it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants