Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

y0n0zawa · 2025-07-01T03:30:26Z

When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters,
etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB)
boundary in the data stream between the Node.js plugin and Ruby server.

This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was
changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8
boundary issue where multibyte characters could be split across chunk boundaries.

Reproduction

# Create a file with exactly 8188 ASCII characters followed by a multibyte character
puts "#{'a' * 8188}😀"

The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.

Solution

Implemented a length-prefixed protocol for socket communication:

Client sends a 4-byte length header before the JSON content
Server reads the exact number of bytes specified in the header
This ensures complete UTF-8 strings are decoded regardless of chunking

Testing

Added comprehensive test coverage:

Test case that reproduces the exact 8KB boundary issue
Verified with various multibyte characters (emojis, Japanese characters)
Ensures the fix works across different buffer sizes

Impact

Fixes data corruption for users working with non-ASCII content
No breaking changes to the API
Minimal performance impact (4-byte overhead per request)

yaa · 2025-07-20T05:30:51Z

I also encountered this issue around the same time and had been working on a fix locally.
This fix appears to have a smaller impact compared to mine. What do you think?

diff --git a/src/plugin.js b/src/plugin.js
index 71d2030..276551c 100644
--- a/src/plugin.js
+++ b/src/plugin.js
@@ -157,6 +157,7 @@ async function parse(parser, source, opts) {

   return new Promise((resolve, reject) => {
     const socket = new net.Socket();
+    socket.setEncoding('utf-8');
     let chunks = "";

     socket.on("error", (error) => {
@@ -164,7 +165,7 @@ async function parse(parser, source, opts) {
     });

     socket.on("data", (data) => {
-      chunks += data.toString("utf-8");
+      chunks += data;
     });

     socket.on("end", () => {

y0n0zawa added 2 commits July 1, 2025 12:21

Add test to check UTF-8 garbling at 8KB boundary

b086fc3

Fix UTF-8 character corruption at 8KB buffer boundaries

b64beee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

y0n0zawa commented Jul 1, 2025

Uh oh!

yaa commented Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

Are you sure you want to change the base?

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

Conversation

y0n0zawa commented Jul 1, 2025

Reproduction

Solution

Testing

Impact

Uh oh!

yaa commented Jul 20, 2025

Uh oh!

Uh oh!