Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813
Answered
by
d-kleine
myme5261314
asked this question in
Q&A
-
Beta Was this translation helpful? Give feedback.
Answered by
d-kleine
Oct 9, 2025
Replies: 1 comment 4 replies
-
Maybe related to the following code in # Check if any disallowed special tokens are in the remainder
disallowed = [
tok for tok in self.inverse_vocab
if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
]
if disallowed:
raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
# If no special tokens, or remaining text after special token split:
tokens = []
lines = text.split("\n")
for i, line in enumerate(lines):
if i > 0:
tokens.append("\n")
words = line.split()
for j, word in enumerate(words):
if j == 0 and i > 0:
tokens.append("Ġ" + word)
elif j == 0:
tokens.append(word)
else:
tokens.append("Ġ" + word)
for token in tokens:
if token in self.inverse_vocab:
token_ids.append(self.inverse_vocab[token])
else:
token_ids.extend(self.tokenize_with_bpe(token))
return token_ids |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just did a simple check, LGTM! Thank you! 👍🏻