Skip to content

Add more information about when to use (and not use) the BOM #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions questions/qa-byte-order-mark-data/translations.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
var trans = { }

trans.versions = ['de','en', 'fr']
trans.versions = ['en']

trans.outofdatetranslations = []
trans.outofdatetranslations = ['de', 'fr']

trans.updatedtranslations = []

Expand Down
9 changes: 5 additions & 4 deletions questions/qa-byte-order-mark.de.html
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
f.path = '../' // what you need to prepend to a URL to get to the /International directory

// AUTHORS AND TRANSLATORS should fill in these assignments:
f.thisVersion = { date:'2016-04-20', time:'11:10'} // date and time of latest edits to this document/translation
f.contributors = 'Albert Lunde, Asmus Freytag, Björn Höhrmann, Henri Sivonen, John Cowan, Leif Halvard Silli, Norbert Lindenberg' // people providing useful contributions or feedback during review or at other times
f.thisVersion = { date:'2025-07-16', time:'11:10'} // date and time of latest edits to this document/translation
f.contributors = 'Albert Lunde, Asmus Freytag, Björn Höhrmann, Fuqiao Xue, Henri Sivonen, John Cowan, Leif Halvard Silli, Norbert Lindenberg' // people providing useful contributions or feedback during review or at other times
// also make sure that the lang attribute on the html tag is correct!
f.sources = '' // describes sources of information

Expand Down Expand Up @@ -86,12 +86,13 @@ <h2>Antwort</h2>
<section id="bomwhat">
<h3>Was ist ein BOM?</h3>
<div class="sidenoteGroup">
<p>Am Anfang einer Webseite, die eine <a class="termref" href="/International/articles/definitions-characters/Overview#unicode">Unicode</a>-<a class="termref" href="/International/articles/definitions-characters/Overview#charsets">Zeichencodierung</a> verwendet, stehen möglicherweise einige Bytes, die das Unicode-Zeichen U+FEFF <span lang="en" xml:lang="en" translate="no">BYTE ORDER MARK</span> (abgekürzt <dfn>BOM</dfn>) darstellen.</p>
<p>A <dfn>Byte Order Mark</dfn>, sometimes abbreviated "BOM", is a special Unicode character intended to appear at the very beginning of a text file. Its original purpose was to indicate the <q><a href="https://en.wikipedia.org/wiki/Endianness">endianness</a></q> of text that used the UTF-16 or UTF-32 character encodings of Unicode. The Byte Order Mark is U+FEFF ZERO WIDTH NON-BREAKING SPACE: the character name refers to a separate, deprecated, use of the character.</p>
<p>Some systems use the BOM code point at the start of a file to indicate that text files are using the UTF-8 character encoding, even though UTF-8 does not need a marker to indicate endianness.</p>
<p>While often invisible and intended to aid in correctly interpreting text, the presence of the BOM can sometimes cause unexpected display issues or problems with software if not handled correctly.</p>
<div class="insideinfonote">
<p class="info">Die Bezeichnung <span lang="en" xml:lang="en" translate="no">BYTE ORDER MARK</span> ist ein Alias für die ursprüngliche Bezeichnung <span lang="en" xml:lang="en" translate="no">ZERO WIDTH NO-BREAK SPACE</span> (ZWNBSP, nullbreites geschütztes Leerzeichen). Mit der Einführung des Zeichens U+2060 <span lang="en" xml:lang="en" translate="no">WORD JOINER</span> (Wortverbinder) besteht es keine Notwendigkeit mehr, U+FEFF in seiner ZWNSP-Funktion zu verwenden. Ab diesem Zeitpunkt und weil es einen formellen Alias gibt, ist die Bezeichnung <span lang="en" xml:lang="en" translate="no">ZERO WIDTH NO-BREAK SPACE</span> nicht mehr passend. Hier wird deswegen der Alias verwendet.</p>
</div>
</div>
<p>Das BOM ist bei korrekter Verwendung unsichtbar.</p>
<p>Bevor UTF-8 Anfang 1993 eingeführt wurde, war der vorgesehene Weg, Unicode-Text zu übertragen, Zeichen in 16 Bit zu codieren. Die Zeichencodierung wurde UCS-2 genannt und später zu UTF-16 erweitert. Einheiten zu 16 Bit können auf zwei Arten in Bytes repräsentiert werden: das höherwertige Byte zuerst (<span class="qterm" lang="en" xml:lang="en" translate="no">big-endian</span>) oder das niederwertige Byte zuerst (<span class="qterm" lang="en" xml:lang="en" translate="no">little-endian</span>). Um anzugeben, welche Reihenfolge der Bytes verwendet wurde, wird das Zeichen U+FEFF (das BOM, <span lang="en" xml:lang="en" translate="no">byte-order mark</span>) an den Anfang des Datenstroms gesetzt – als Wundermittel, das sinngemäß nicht zum Text gehört, den der Datenstrom repräsentiert.</p>
<p>Die folgende Abbildung zeigt die Bytes für eine Folge von Zwei-Byte-Zeichen. Jede Hexadezimalzahl mit 2 Ziffern steht für ein Byte im Datenstrom. Sie können sehen, das die Reihenfolge der beiden Bytes, die ein Zeichen repräsentieren, bei <span class="qterm" lang="en" xml:lang="en" translate="no">big-endian</span> gegenüber <span class="qterm" lang="en" xml:lang="en" translate="no">little-endian</span> umgedreht ist. Das BOM zeigt an, welche Reihenfolge gilt, damit die Anwendung den Inhalt unmittelbar decodieren kann.</p>
<p><img src="qa-byte-order-mark-data/bom.png" alt="Bytes, die das BOM repräsentieren." /></p>
Expand Down
46 changes: 37 additions & 9 deletions questions/qa-byte-order-mark.en.html
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
f.path = '../' // what you need to prepend to a URL to get to the /International directory

// AUTHORS AND TRANSLATORS should fill in these assignments:
f.thisVersion = { date:'2016-04-20', time:'11:10'} // date and time of latest edits to this document/translation
f.contributors = 'Albert Lunde, Asmus Freytag, Björn Höhrmann, Henri Sivonen, John Cowan, Leif Halvard Silli, Norbert Lindenberg, Gwendoline Clavé' // people providing useful contributions or feedback during review or at other times
f.thisVersion = { date:'2025-07-17', time:'11:10'} // date and time of latest edits to this document/translation
f.contributors = 'Albert Lunde, Asmus Freytag, Björn Höhrmann, Fuqiao Xue, Henri Sivonen, John Cowan, Leif Halvard Silli, Norbert Lindenberg, Gwendoline Clavé' // people providing useful contributions or feedback during review or at other times
// also make sure that the lang attribute on the html tag is correct!
f.sources = '' // describes sources of information

Expand Down Expand Up @@ -104,14 +104,14 @@ <h2>Answer</h2>
<h3> What is a byte-order mark?</h3>

<div class="sidenoteGroup">
<p>At the beginning of a page that uses a <a class="termref" href="/International/articles/definitions-characters/#unicode">Unicode</a> <a class="termref" href="/International/articles/definitions-characters/#charsets">character encoding</a> you may find some bytes that represent the Unicode code point U+FEFF BYTE ORDER MARK (abbreviated as <dfn>BOM</dfn>).</p>
<p>A <dfn>Byte Order Mark</dfn>, sometimes abbreviated "BOM", is a special Unicode character intended to appear at the very beginning of a text file. Its original purpose was to indicate the <q><a href="https://en.wikipedia.org/wiki/Endianness">endianness</a></q> of text that used the UTF-16 or UTF-32 character encodings of Unicode. The Byte Order Mark is U+FEFF ZERO WIDTH NON-BREAKING SPACE: the character name refers to a separate, deprecated, use of the character.</p>
<p>Some systems use the BOM code point at the start of a file to indicate that text files are using the UTF-8 character encoding, even though UTF-8 does not need a marker to indicate endianness.</p>
<p>While often invisible and intended to aid in correctly interpreting text, the presence of the BOM can sometimes cause unexpected display issues or problems with software if not handled correctly.</p>
<div class="insideinfonote">
<p class="info">The name BYTE ORDER MARK is an alias for the original character name ZERO WIDTH NO-BREAK SPACE (ZWNBSP). With the introduction of U+2060 WORD JOINER, there's no longer a need to ever use U+FEFF for its ZWNSP effect, so from that point on, and with the availability of a formal alias, the name ZERO WIDTH NO-BREAK SPACE is no longer helpful, and we will use the alias here.</p>
</div>
</div>

<p>The BOM, when correctly used, is invisible.</p>

<p>Before UTF-8 was introduced in early 1993, the expected way for transferring Unicode text was using 16-bit code units using an encoding called UCS-2 which was later extended to UTF-16. 16-bit code units can be expressed as bytes in two ways: the most significant byte first (<span class="qterm">big-endian</span>) or the least significant byte first (<span class="qterm">little-endian</span>). To communicate which byte order was in use, U+FEFF (the byte-order mark) was used at the start of the stream as a magic number that is not logically part of the text the stream represents.</p>

<p>The picture below shows the bytes used in a sequence of two-byte characters. Each 2-digit hexadecimal number represents a byte in the stream of text. You can see that the order of the two bytes that represent a single character is reversed for big endian vs. little endian storage. The byte-order mark indicates which order is used, so that applications can immediately decode the content.</p>
Expand Down Expand Up @@ -142,6 +142,34 @@ <h3> What do I need to know about the BOM?</h3>
<p>If you use a UTF-16 encoding for your page (and we strongly recommend that you don't), there are some <a href="#additionalinfo">additional considerations</a>.</p>
</section>

<section id="whenToUseBOM">
<h3>When to Use (and Not Use) the BOM</h3>
<p>The necessity and recommendation for using a BOM varies significantly depending on the Unicode encoding scheme being used.</p>

<h4>UTF-8</h4>
<p>For UTF-8, the BOM is the byte sequence <code>EF BB BF</code>. Unlike UTF-16 and UTF-32, UTF-8 does not have byte order (endianness) issues, so a BOM is not needed for this purpose. Its only function in UTF-8 is to act as a "signature" to indicate that the file is UTF-8 encoded. The Unicode Standard permits the BOM in UTF-8 but does not recommend its use.</p>
<p><strong>Recommendation:</strong> Generally, it's best to avoid using a BOM with UTF-8 files unless you have a specific reason or compatibility requirement. Always prefer UTF-8 without a BOM if possible.</p>

<h4>UTF-16 (UTF-16BE & UTF-16LE)</h4>
<p>For UTF-16, the BOM is crucial for indicating endianness if the specific endianness is not already defined by the character set label (e.g., if labeled just as "UTF-16").</p>
<ul>
<li><code>FE FF</code>: Indicates Big Endian (UTF-16BE).</li>
<li><code>FF FE</code>: Indicates Little Endian (UTF-16LE).</li>
<li>If a UTF-16 stream is read with the wrong endianness, the BOM character <code>U+FEFF</code> will appear as <code>U+FFFE</code>, which is a noncharacter.</li>
<li>If the character set is explicitly stated as "UTF-16BE" or "UTF-16LE", a BOM should <em>not</em> be used as the byte order is already known.</li>
<li><strong>Recommendation:</strong> Use a BOM if your UTF-16 data might be interpreted by systems with different native endianness and the specific endianness (BE or LE) is not declared by a higher-level protocol. If the specific UTF-16 encoding (LE or BE) is known and declared, omit the BOM. (However, for HTML, UTF-8 is strongly preferred over UTF-16).</li>
</ul>

<h4>UTF-32 (UTF-32BE & UTF-32LE)</h4>
<p>Similar to UTF-16, the BOM in UTF-32 indicates endianness but UTF-32 is rarely used for transmission or web content.</p>
<ul>
<li><code>00 00 FE FF</code>: Indicates Big Endian (UTF-32BE).</li>
<li><code>FF FE 00 00</code>: Indicates Little Endian (UTF-32LE).</li>
<li><strong>Recommendation:</strong> Similar to UTF-16, use a BOM if endianness is not otherwise specified. (Again, UTF-8 is preferred for HTML).</li>
</ul>
</section>





Expand Down Expand Up @@ -271,18 +299,18 @@ <h3>Removing the BOM</h3>
<section id="additionalinfo">
<h2>Additional information</h2>

<p>Here are some additional notes for those who are encoding their HTML pages using UTF-16. Note that, for HTML it's recommended that you use UTF-8 and that you avoid UTF-16. So for most people this section will be academic.</p>
<p>This section provides further details primarily for those encoding HTML pages using UTF-16 or UTF-32. As a strong general recommendation, <strong>UTF-8 should be used for all HTML content</strong> over UTF-16 or UTF-32.</p>

<div class="sidenoteGroup">
<p>According to RFC 2718 and the Unicode Standard, if you declare the character encoding of your page using HTTP as either &quot;UTF-16LE&quot; or &quot;UTF-16BE&quot; then you should not use a byte-order mark at the beginning of the page. Only if the page is labelled in HTTP using IANA charset name &quot;UTF-16&quot; is a byte-order mark appropriate.</p>
<p>For <strong>UTF-16</strong>, as detailed in the <a href="#whenToUseBOM">"When to Use (and Not Use) the BOM"</a> section, a BOM is appropriate if the page is simply labeled with the IANA charset "UTF-16" to indicate endianness. However, if the character encoding is declared via HTTP as specifically "UTF-16LE" or "UTF-16BE", a BOM should not be used. This guidance aligns with RFC 2718 and the Unicode Standard.</p>
<div class="sideinfonote">
<p class="warning">Note that this is solely about the <em>labeling</em> of the content. Of course, the actual sequence of bytes is the same, whether you label content as UTF-16 and add a BOM, or whether you label it as UTF-16LE or UTF-16BE.</p>
</div>
</div>

<p>The HTML5 specification currently disallows the use of any other, text-based in-document encoding declaration for pages using the UTF-16 encoding. In effect, this means that the BOM is, itself, the declaration that you have to add.</p>
<p>The HTML5 specification currently disallows the use of any other, text-based in-document encoding declarations (like a <code class="kw" translate="no">meta</code> tag) for pages using UTF-16. In effect, if you are using the generic "UTF-16" label, the BOM itself serves as the necessary in-stream declaration of byte order.</p>

<p>The byte-order mark is also used for text labeled as UTF-32, and should not be used for text labeled as UTF-32BE or UTF-32LE. The use of UTF-32 for HTML content, however, is strongly discouraged and some implementations have removed support for it, so we haven't even mentioned it until now.</p>
<p>Similarly, for <strong>UTF-32</strong>, a BOM can be used if the content is labeled generically as "UTF-32". It should not be used if the label is specifically "UTF-32BE" or "UTF-32LE". However, the use of UTF-32 for HTML content is strongly discouraged, and some implementations have removed support for it.</p>
</section>


Expand Down
Loading