Skip to content

Conversation

@kapyteinaikido
Copy link
Collaborator

https://docs.python.org/3/library/codecs.html

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1),

this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

This should not have impact as we're only reading the file, and thus the changed encoding is only used to be able to read out the BOM character (if it exists) and omit it.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

New/Edited policies (Delete if not relevant)

Description

Include a description of what makes it a violation and any relevant external links.

Fix

How does someone fix the issue in code and/or in runtime?

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my feature, policy, or fix is effective and works
  • New and existing tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@willem-delbare willem-delbare merged commit c446efd into main Jul 29, 2025
1 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants