Skip to content

Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

y0n0zawa
Copy link

@y0n0zawa y0n0zawa commented Jul 1, 2025

When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters,
etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB)
boundary in the data stream between the Node.js plugin and Ruby server.

This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was
changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8
boundary issue where multibyte characters could be split across chunk boundaries.

Reproduction

# Create a file with exactly 8188 ASCII characters followed by a multibyte character
puts "#{'a' * 8188}😀"

The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.

Solution

Implemented a length-prefixed protocol for socket communication:

  1. Client sends a 4-byte length header before the JSON content
  2. Server reads the exact number of bytes specified in the header
  3. This ensures complete UTF-8 strings are decoded regardless of chunking

Testing

Added comprehensive test coverage:

  • Test case that reproduces the exact 8KB boundary issue
  • Verified with various multibyte characters (emojis, Japanese characters)
  • Ensures the fix works across different buffer sizes

Impact

  • Fixes data corruption for users working with non-ASCII content
  • No breaking changes to the API
  • Minimal performance impact (4-byte overhead per request)

@yaa
Copy link

yaa commented Jul 20, 2025

I also encountered this issue around the same time and had been working on a fix locally.
This fix appears to have a smaller impact compared to mine. What do you think?

diff --git a/src/plugin.js b/src/plugin.js
index 71d2030..276551c 100644
--- a/src/plugin.js
+++ b/src/plugin.js
@@ -157,6 +157,7 @@ async function parse(parser, source, opts) {

   return new Promise((resolve, reject) => {
     const socket = new net.Socket();
+    socket.setEncoding('utf-8');
     let chunks = "";

     socket.on("error", (error) => {
@@ -164,7 +165,7 @@ async function parse(parser, source, opts) {
     });

     socket.on("data", (data) => {
-      chunks += data.toString("utf-8");
+      chunks += data;
     });

     socket.on("end", () => {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants