Skip to content

Optimize WebFlux multipart upload performance #35366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

xyraclius
Copy link

@xyraclius xyraclius commented Aug 21, 2025

🚀 Overview

This PR improves performance in WebFlux multipart upload processing by optimizing how AbstractNestedMatcher scans DataBuffer instances for delimiters.


🔥 Motivation

Multipart uploads in WebFlux currently suffer from slower performance compared to Spring MVC, especially with large files. A significant bottleneck was found in the delimiter matching logic, which processed buffers one byte at a time and caused unnecessary overhead.


🔧 Changes

  • Reworked AbstractNestedMatcher to use:
    • A thread-local buffer (LOCAL_BUFFER) to avoid per-call allocations.
      Replaced with an instance-local buffer (localBuffer) to simplify buffer management.
    • Chunk-based scanning instead of byte-by-byte iteration.
  • Extracted helper methods (processChunk, findNextCandidate,
    updateMatchIndex) to reduce complexity and improve readability.
  • Preserved existing API and behavior for backward compatibility.
  • Added clear Javadoc to explain usage and limitations.

✅ Benefits

  • Reduced allocation overhead during large multipart uploads.
  • Improved throughput when scanning large DataBuffer instances.
  • Better maintainability through clearer structure and documentation.
  • No breaking changes — existing API and behavior remain intact.
  • No new tests required — code is already fully covered (100%) by
    existing unit tests.

📈 Performance Impact

  • 200MB file via MultipartFile (Spring MVC): ~700 ms
  • 200MB file via FilePart (WebFlux): ~4.1 s
  • 200MB file via PartEvent (WebFlux): ~4.2 s

After this change, large multipart uploads in WebFlux no longer suffer from excessive overhead in delimiter scanning.

📊 Benchmark Results

Before Optimization

curl -w "Elapsed: %{time_total}s\n" \
  -F file="@200mb.pdf" \
  http://localhost:8080/file-part \
  -o /dev/null -s
# Elapsed: 4.177509s

curl -w "Elapsed: %{time_total}s\n" \
  -F file="@200mb.pdf" \
  http://localhost:8080/part-event \
  -o /dev/null -s
# Elapsed: 4.224732s

After Optimization

curl -w "Elapsed: %{time_total}s\n" \
  -F file="@200mb.pdf" \
  http://localhost:8080/file-part \
  -o /dev/null -s
# Elapsed: 0.437359s

curl -w "Elapsed: %{time_total}s\n" \
  -F file="@200mb.pdf" \
  http://localhost:8080/part-event \
  -o /dev/null -s
# Elapsed: 0.317705s


#Test 19.53GB file
curl -w "\nElapsed: %{time_total}s\n" \
  -F file="@game.rar" \
  http://localhost:8080/file-part \
  -o /dev/null -s
# Elapsed: 27.586772s

curl -w "Elapsed: %{time_total}s\n" \
     -F file="@game.rar" \
     http://localhost:8080/part-event \
     -o /dev/null -s
# Elapsed: 18.328275s


#Test Multiple files
curl -w "Elapsed: %{time_total}s\n" \
  -F file="@200mb.pdf" \
  -F file="@game.rar" \
  http://localhost:8080/file-part \
  -o /dev/null -s
# Elapsed: 23.678042s

curl -w "Elapsed: %{time_total}s\n" \
     -F file="@200mb.pdf" \
     -F file="@game.rar" \
     http://localhost:8080/part-event \
     -o /dev/null -s
# Elapsed: 17.001363s

Example

@PostMapping(path = "/file-part", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public Mono<Void> uploadFilePart(@RequestPart("file") FilePart file) {
    log.info("Received file part request {}", file.filename());
    return Mono.empty();
}

@PostMapping(path = "/part-event", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public Mono<Void> uploadPartEvent(@RequestBody Flux<PartEvent> allPartsEvents) {
    return allPartsEvents.windowUntil(PartEvent::isLast)
        .concatMap(p -> p.switchOnFirst((signal, partEvents) -> {
            if (signal.hasValue()) {
                PartEvent event = signal.get();
                if (event instanceof FormPartEvent formEvent) {
                    return Mono.error(new RuntimeException("Unexpected event: " + event));
                } else if (event instanceof FilePartEvent fileEvent) {
                    log.info("Received part event request {}", fileEvent.filename());
                    return partEvents
                            .ofType(FilePartEvent.class)
                            .doOnNext(file -> DataBufferUtils.release(file.content()))
                            .then(Mono.empty());
                } else {
                    return Mono.error(new RuntimeException("Unexpected event: " + event));
                }
            } else {
                return partEvents;
            }
        })).then();
}

Related Issue

Closes gh-34651


Checklist

Improve AbstractNestedMatcher by using a thread-local buffer and chunked
scanning to reduce allocations and speed up multipart boundary detection.

Closes spring-projectsgh-34651

Signed-off-by: Nabil Fawwaz Elqayyim <[email protected]>
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Aug 21, 2025
@bclozel
Copy link
Member

bclozel commented Aug 21, 2025

Thanks for raising this @xyraclius , but I'm not sure the approach is valid.

For WebFlux applications, there is no assumption about the processing of a single request. Unlike Servlet applications where the processing of a request is happening on a single thread, reactive apps can schedule work on many different threads. Isn't using a ThreadLocal likely to create concurrency issues if many multipart requests happen in parallel?

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Aug 21, 2025
@xyraclius
Copy link
Author

Hi @bclozel,

I initially used ThreadLocal to reduce per-call allocations and improve CPU/memory usage. That said, I now understand that in WebFlux a single request can run across multiple threads, so using ThreadLocal could be unsafe.

The safest approach would be to switch to a per-request local buffer, like:

final byte[] chunk = new byte[8 * 1024];

This avoids any concurrency issues while keeping allocations reasonable. I can implement this change and test the performance to make sure we maintain the improvements.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Aug 21, 2025
@bclozel bclozel added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Aug 21, 2025
@sdeleuze
Copy link
Contributor

Yeah please refine accordingly and we will review it.

@xyraclius
Copy link
Author

Hi @bclozel @sdeleuze,
Refined the change to use a per-instance buffer instead of ThreadLocal, as suggested.
✅ Verified correctness and perf still look good.

Please let me know if further adjustments are needed. Thanks for the guidance! 🙏

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Aug 22, 2025
- Replace ThreadLocal buffer with a per-instance reusable buffer
- Improves memory locality and reduces ThreadLocal overhead
- Update Javadoc for clarity, performance notes, and subclassing guidance

Closes spring-projectsgh-34651

Signed-off-by: Nabil Fawwaz Elqayyim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged or decided on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants