Skip to content

Conversation

StanFromIreland
Copy link
Member

@StanFromIreland StanFromIreland commented Sep 25, 2025

@corona10
Copy link
Member

@StanFromIreland Thank you for the investigation. Let me take a look at this weekend :)

@corona10 corona10 added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Sep 28, 2025
Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
This patch prevents NULL bytes from being consumed as part of character pair encoding in the string. While this fixes the data loss bug, it does change existing behavior, so backporting needs discussion.

cc @methane

@corona10
Copy link
Member

corona10 commented Oct 4, 2025

I will wait Naoki san for a week and plan to merge this PR.

@methane
Copy link
Member

methane commented Oct 6, 2025

euc_jis_2004 has same logic. would you update it too?

>>> "\u00e6abc".encode('euc_jis_2004')
b'\xa9\xdcabc'
>>> "\u00e6\0abc".encode('euc_jis_2004')
b'\xa9\xdcabc'

@StanFromIreland
Copy link
Member Author

euc_jis_2004 has same logic. would you update it too?

Is done.

@methane methane changed the title gh-101828: Fix shift_jisx0213 & shift_jis_2004 codecs removing null characters gh-101828: Fix jisx0213 codecs removing null characters Oct 7, 2025
@methane
Copy link
Member

methane commented Oct 7, 2025

iso2022_jp_3 and iso2022_jp_2004 have same issue.
Would you add this patch?

diff --git a/Modules/cjkcodecs/_codecs_iso2022.c b/Modules/cjkcodecs/_codecs_iso2022.c
index ef6faeb7127..83afdd0a1ee 100644
--- a/Modules/cjkcodecs/_codecs_iso2022.c
+++ b/Modules/cjkcodecs/_codecs_iso2022.c
@@ -802,10 +802,12 @@ jisx0213_encoder(const MultibyteCodec *codec, const Py_UCS4 *data,
         return coded;

     case 2: /* second character of unicode pair */
-        coded = find_pairencmap((ucs2_t)data[0], (ucs2_t)data[1],
-                                jisx0213_pair_encmap, JISX0213_ENCPAIRS);
-        if (coded != DBCINV)
-            return coded;
+        if (data[1] != 0) { /* Don't consume null char as part of pair */
+            coded = find_pairencmap((ucs2_t)data[0], (ucs2_t)data[1],
+                                    jisx0213_pair_encmap, JISX0213_ENCPAIRS);
+            if (coded != DBCINV)
+                return coded;
+        }
         _Py_FALLTHROUGH;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants