Skip to content

Commit a18b381

Browse files
[3.11] gh-137836: Support more RAWTEXT and PLAINTEXT elements in HTMLParser (GH-137837) (GH-140842) (GH-140852)
(cherry picked from commit a17c57e) (cherry picked from commit 0329bd1) Co-authored-by: Łukasz Langa <[email protected]>
1 parent 20fe182 commit a18b381

File tree

4 files changed

+163
-114
lines changed

4 files changed

+163
-114
lines changed

Doc/library/html.parser.rst

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,18 @@
1515
This module defines a class :class:`HTMLParser` which serves as the basis for
1616
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
1717

18-
.. class:: HTMLParser(*, convert_charrefs=True)
18+
.. class:: HTMLParser(*, convert_charrefs=True, scripting=False)
1919

2020
Create a parser instance able to parse invalid markup.
2121

22-
If *convert_charrefs* is ``True`` (the default), all character
23-
references (except the ones in ``script``/``style`` elements) are
22+
If *convert_charrefs* is true (the default), all character
23+
references (except the ones in elements like ``script`` and ``style``) are
2424
automatically converted to the corresponding Unicode characters.
2525

26+
If *scripting* is false (the default), the content of the ``noscript``
27+
element is parsed normally; if it's true, it's returned as is without
28+
being parsed.
29+
2630
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
2731
when start tags, end tags, text, comments, and other markup elements are
2832
encountered. The user should subclass :class:`.HTMLParser` and override its
@@ -37,6 +41,9 @@ parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
3741
.. versionchanged:: 3.5
3842
The default value for argument *convert_charrefs* is now ``True``.
3943

44+
.. versionchanged:: 3.11.15
45+
Added the *scripting* parameter.
46+
4047

4148
Example HTML Parser Application
4249
-------------------------------
@@ -159,24 +166,24 @@ implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
159166
.. method:: HTMLParser.handle_data(data)
160167

161168
This method is called to process arbitrary data (e.g. text nodes and the
162-
content of ``<script>...</script>`` and ``<style>...</style>``).
169+
content of elements like ``script`` and ``style``).
163170

164171

165172
.. method:: HTMLParser.handle_entityref(name)
166173

167174
This method is called to process a named character reference of the form
168175
``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
169-
(e.g. ``'gt'``). This method is never called if *convert_charrefs* is
170-
``True``.
176+
(e.g. ``'gt'``).
177+
This method is only called if *convert_charrefs* is false.
171178

172179

173180
.. method:: HTMLParser.handle_charref(name)
174181

175182
This method is called to process decimal and hexadecimal numeric character
176183
references of the form :samp:`&#{NNN};` and :samp:`&#x{NNN};`. For example, the decimal
177184
equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
178-
in this case the method will receive ``'62'`` or ``'x3E'``. This method
179-
is never called if *convert_charrefs* is ``True``.
185+
in this case the method will receive ``'62'`` or ``'x3E'``.
186+
This method is only called if *convert_charrefs* is false.
180187

181188

182189
.. method:: HTMLParser.handle_comment(data)
@@ -284,8 +291,8 @@ Parsing an element with a few attributes and a title::
284291
Data : Python
285292
End tag : h1
286293

287-
The content of ``script`` and ``style`` elements is returned as is, without
288-
further parsing::
294+
The content of elements like ``script`` and ``style`` is returned as is,
295+
without further parsing::
289296

290297
>>> parser.feed('<style type="text/css">#python { color: green }</style>')
291298
Start tag: style
@@ -294,10 +301,10 @@ further parsing::
294301
End tag : style
295302

296303
>>> parser.feed('<script type="text/javascript">'
297-
... 'alert("<strong>hello!</strong>");</script>')
304+
... 'alert("<strong>hello! &#9786;</strong>");</script>')
298305
Start tag: script
299306
attr: ('type', 'text/javascript')
300-
Data : alert("<strong>hello!</strong>");
307+
Data : alert("<strong>hello! &#9786;</strong>");
301308
End tag : script
302309

303310
Parsing comments::
@@ -317,7 +324,7 @@ correct char (note: these 3 references are all equivalent to ``'>'``)::
317324

318325
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
319326
:meth:`~HTMLParser.handle_data` might be called more than once
320-
(unless *convert_charrefs* is set to ``True``)::
327+
if *convert_charrefs* is false::
321328

322329
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
323330
... parser.feed(chunk)

Lib/html/parser.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -109,16 +109,24 @@ class HTMLParser(_markupbase.ParserBase):
109109
argument.
110110
"""
111111

112-
CDATA_CONTENT_ELEMENTS = ("script", "style")
112+
# See the HTML5 specs section "13.4 Parsing HTML fragments".
113+
# https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
114+
# CDATA_CONTENT_ELEMENTS are parsed in RAWTEXT mode
115+
CDATA_CONTENT_ELEMENTS = ("script", "style", "xmp", "iframe", "noembed", "noframes")
113116
RCDATA_CONTENT_ELEMENTS = ("textarea", "title")
114117

115-
def __init__(self, *, convert_charrefs=True):
118+
def __init__(self, *, convert_charrefs=True, scripting=False):
116119
"""Initialize and reset this instance.
117120
118-
If convert_charrefs is True (the default), all character references
121+
If convert_charrefs is true (the default), all character references
119122
are automatically converted to the corresponding Unicode characters.
123+
124+
If *scripting* is false (the default), the content of the
125+
``noscript`` element is parsed normally; if it's true,
126+
it's returned as is without being parsed.
120127
"""
121128
self.convert_charrefs = convert_charrefs
129+
self.scripting = scripting
122130
self.reset()
123131

124132
def reset(self):
@@ -153,7 +161,9 @@ def get_starttag_text(self):
153161
def set_cdata_mode(self, elem, *, escapable=False):
154162
self.cdata_elem = elem.lower()
155163
self._escapable = escapable
156-
if escapable and not self.convert_charrefs:
164+
if self.cdata_elem == 'plaintext':
165+
self.interesting = re.compile(r'\Z')
166+
elif escapable and not self.convert_charrefs:
157167
self.interesting = re.compile(r'&|</%s(?=[\t\n\r\f />])' % self.cdata_elem,
158168
re.IGNORECASE|re.ASCII)
159169
else:
@@ -434,8 +444,10 @@ def parse_starttag(self, i):
434444
self.handle_startendtag(tag, attrs)
435445
else:
436446
self.handle_starttag(tag, attrs)
437-
if tag in self.CDATA_CONTENT_ELEMENTS:
438-
self.set_cdata_mode(tag)
447+
if (tag in self.CDATA_CONTENT_ELEMENTS or
448+
(self.scripting and tag == "noscript") or
449+
tag == "plaintext"):
450+
self.set_cdata_mode(tag, escapable=False)
439451
elif tag in self.RCDATA_CONTENT_ELEMENTS:
440452
self.set_cdata_mode(tag, escapable=True)
441453
return endpos

0 commit comments

Comments
 (0)