Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead #2668

NuojCheng · 2025-11-12T01:49:01Z

Description

This PR updates the training loop to align with MaxDiffusion patterns, improving the stability of TFLOP/s/device reporting and eliminating metric outliers.

Motivation

Previously, reported step times were heavily influenced by two factors, leading to noisy benchmarking results:

Profiler Overhead: The synchronization required for clean profiling (blocking the device) caused massive outliers in step time during activation/deactivation steps.
Synchronous I/O: Data loading on the main thread introduced latency spikes that were falsely attributed to model performance.

Changes

Asynchronous Data Loading: Implemented ThreadPoolExecutor to prefetch the next batch in a background thread, allowing data loading to overlap with device computation.
Isolated Step Timing: Modified the timing logic to strictly track p_train_step execution by adding an explicit block_until_ready() call to the loss metric.

Tests

Example log difference

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2025-11-12T18:10:25Z

🤖 Hi @NuojCheng, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request effectively addresses performance metric inaccuracies by introducing asynchronous data loading and refining step timing. The use of ThreadPoolExecutor for prefetching batches is a solid approach to overlap I/O with computation. The core logic changes are sound and well-motivated.

🔍 General Feedback

The change to isolate p_train_step for timing purposes by using jax.block_until_ready() is a significant improvement for accurate benchmarking.
The code is well-structured, and the addition of asynchronous loading is handled cleanly.
One minor suggestion was made to add a comment for clarity.

Overall, this is a valuable enhancement for performance analysis.

github-actions · 2025-11-12T18:11:39Z

src/MaxText/train.py

+          jax.block_until_ready(state)  # Ensure compilation has finished.
+          gcs_utils.upload_dump(
+              config.dump_hlo_local_dir,
+              config.dump_hlo_gcs_dir,


🟢 To enhance clarity for future maintainers, it would be beneficial to add a comment explaining why jax.block_until_ready(state) is called here. This call is crucial for ensuring that all asynchronous device operations for the current training step are complete before the step time is recorded. This guarantees an accurate measurement of the true execution time.

Suggested change

config.dump_hlo_gcs_dir,

state, metrics = p_train_step(state, example_batch, nextrng)

# Block to get accurate step time

jax.block_until_ready(state)

NuojCheng added the draft Draft PR label Nov 12, 2025

NuojCheng force-pushed the chengnuojin-thread-profiler branch 3 times, most recently from 150872d to a22c9fd Compare November 12, 2025 17:25

subprocess profiler

47163f7

NuojCheng force-pushed the chengnuojin-thread-profiler branch from a22c9fd to 47163f7 Compare November 12, 2025 17:56

NuojCheng changed the title ~~Add subprocess on profiler and dataloader to remove outlier steps~~ Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead Nov 12, 2025

NuojCheng added the gemini-review label Nov 12, 2025

github-actions bot reviewed Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead #2668

Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead #2668

Uh oh!

NuojCheng commented Nov 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead #2668

Are you sure you want to change the base?

Async Data Loading & Accurate Step Timing to Exclude Profiler Overhead #2668

Uh oh!

Conversation

NuojCheng commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Changes

Tests

Checklist

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

github-actions bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NuojCheng commented Nov 12, 2025 •

edited

Loading