Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

myme5261314 · 2025-09-09T07:28:08Z

myme5261314
Sep 9, 2025

In the bpe-from-scratch.ipynb, below is the code snippet from 23th cell.

    def load_vocab_and_merges_from_openai(self, vocab_path, bpe_merges_path):
        """
        Load pre-trained vocabulary and BPE merges from OpenAI's GPT-2 files.

        Args:
            vocab_path (str): Path to the vocab file (GPT-2 calls it 'encoder.json').
            bpe_merges_path (str): Path to the bpe_merges file  (GPT-2 calls it 'vocab.bpe').
        """
        # Load vocabulary
        with open(vocab_path, "r", encoding="utf-8") as file:
            loaded_vocab = json.load(file)
            # Convert loaded vocabulary to correct format
            self.vocab = {int(v): k for k, v in loaded_vocab.items()}
            self.inverse_vocab = {k: int(v) for k, v in loaded_vocab.items()}

        # Handle newline character without adding a new token
        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]
            else:
                # If no fallback token is available, raise an error
                raise KeyError("No suitable token found in vocabulary to map '\\n'.")

            self.inverse_vocab["\n"] = newline_token_id
            self.vocab[newline_token_id] = "\n"

The last line will replace the vocab's "<|endoftext|>" with "\n". I'm not sure why the code needs to do the substitution.
I found no suspicious "\n" special actions in bpe_openai_gpt2.py.

def get_encoder(model_name, models_dir):
    with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(encoder=encoder, bpe_merges=bpe_merges)

I've confirmed through tiktoken interactive app, the token id of "\n" is 198, and the "<|endoftext|>" remains its original token id 50256 from encoder.json.

But I found the token id 198 in encoder.json is not "\n" character but "\u010a": 198 which rendered as

>>> "\u010a"
'Ċ'

Anyone know the details?

Answered by d-kleine

Oct 9, 2025

Just did a simple check, LGTM! Thank you! 👍🏻

View full answer

myme5261314 · 2025-09-09T08:07:40Z

myme5261314
Sep 9, 2025
Author

Maybe related to the following code in BPETokenizerSimple.encode. I'm not sure.

            # Check if any disallowed special tokens are in the remainder
            disallowed = [
                tok for tok in self.inverse_vocab
                if tok.startswith("<|") and tok.endswith("|>") and tok in text and tok not in allowed_special
            ]
            if disallowed:
                raise ValueError(f"Disallowed special tokens encountered in text: {disallowed}")
    
        # If no special tokens, or remaining text after special token split:
        tokens = []
        lines = text.split("\n")
        for i, line in enumerate(lines):
            if i > 0:
                tokens.append("\n")
            words = line.split()
            for j, word in enumerate(words):
                if j == 0 and i > 0:
                    tokens.append("Ġ" + word)
                elif j == 0:
                    tokens.append(word)
                else:
                    tokens.append("Ġ" + word)
    
        for token in tokens:
            if token in self.inverse_vocab:
                token_ids.append(self.inverse_vocab[token])
            else:
                token_ids.extend(self.tokenize_with_bpe(token))
    
        return token_ids

4 replies

d-kleine Oct 8, 2025

Pinging @rasbt in case this question slipped through.

The assignment "\u010a": 198 for 'Ċ' is correct so far as this is a special token for a line break (indicated as \n in the original text). But I am not sure why the last line replaces <|endoftext|> with \n in the vocabulary. For me, it looks like the BPE tokenizer might not be working fully correctly for line breaks yet.

rasbt Oct 9, 2025
Maintainer

@d-kleine @myme5261314

        if "\n" not in self.inverse_vocab:
            # Use an existing token ID as a placeholder for '\n'
            # Preferentially use "<|endoftext|>" if available
            fallback_token = next((token for token in ["<|endoftext|>", "Ġ", ""] if token in self.inverse_vocab), None)
            if fallback_token is not None:
                newline_token_id = self.inverse_vocab[fallback_token]

that was bad. It's updated now:

Also there's now explicit:

        # Must have GPT-2's printable newline character 'Ċ' (U+010A) at id 198
        if "Ċ" not in self.inverse_vocab or self.inverse_vocab["Ċ"] != 198:
            raise KeyError("Vocabulary missing GPT-2 newline glyph 'Ċ' at id 198.")
    
        # Must have <|endoftext|> at 50256
        if "<|endoftext|>" not in self.inverse_vocab or self.inverse_vocab["<|endoftext|>"] != 50256:
            raise KeyError("Vocabulary missing <|endoftext|> at id 50256.")

d-kleine Oct 9, 2025

Just did a simple check, LGTM! Thank you! 👍🏻

Answer selected by rasbt

rasbt Oct 9, 2025
Maintainer

Thanks for double-checking!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why handle "\n" specially to replace the "<|endoftext|>" in BPETokenizerSimple #813

Uh oh!

myme5261314 Sep 9, 2025

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

myme5261314 Sep 9, 2025 Author

Uh oh!

d-kleine Oct 8, 2025

Uh oh!

rasbt Oct 9, 2025 Maintainer

Uh oh!

d-kleine Oct 9, 2025

Uh oh!

rasbt Oct 9, 2025 Maintainer

myme5261314
Sep 9, 2025

Replies: 1 comment 4 replies

myme5261314
Sep 9, 2025
Author

rasbt Oct 9, 2025
Maintainer

rasbt Oct 9, 2025
Maintainer