Add conversion script for Qwen3 Next and Readme #2672

Rohan-Bierneni · 2025-11-12T19:06:32Z

Description

This pr is the final pr to have qwen3 next model fully supported in maxtext. The pr will include conversion scripts from huggingface, and verify logits comparision between the hf and maxtext model.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456

Tests

(Unscanned) Forward pass logit checker: https://paste.googleplex.com/6146326802857984

(Scanned) Forward pass logit checker: https://paste.googleplex.com/6195553369194496

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

add debug statements Conversion script ran without failing test verify orbax hf tensors Add unscanned conversion script for qwen3 next Move gating op to after sharding optimizations added zero centered rmsnorm Add layer by layer comparision script Remove debug files

… layers

parambole · 2025-11-12T20:08:15Z

src/MaxText/convert_qwen3_next.py

+#   llama_or_mistral_ckpt.save_weights_to_checkpoint(
+#       args.maxtext_model_path,
+#       jax_weights,
+#       args.simulated_cpu_devices_count,
+#       args.use_ocdbt,
+#       args.use_zarr3,
+#   )
+#   max_logging.log("Checkpoint saved successfully.")


Remove unused code ?

parambole · 2025-11-12T20:08:38Z

src/MaxText/convert_qwen3_next.py

+
+  model_params = MODEL_PARAMS_DICT[args.model_size]
+  max_logging.log(f"Starting conversion for Qwen3-Next model size: {args.model_size}")
+  # jax_weights = convert_hf_to_maxtext(args.base_model_path, model_params)


Is this not required ?

parambole · 2025-11-12T20:09:58Z

src/MaxText/convert_qwen3_next.py

+  }
+
+
+def verify_conversion(maxtext_weights: Dict[str, Any], chkpt_vars: Dict[str, torch.Tensor], model_params: Dict[str, Any]):


Can you elaborate on this ?

github-actions · 2025-11-13T12:41:17Z

🤖 Hi @Rohan-Bierneni, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This Pull Request introduces comprehensive support for the Qwen3-Next model, including both scanned and unscanned checkpoint conversion scripts. The integration of heterogeneous layers and the new configuration validation are positive additions, demonstrating a thoughtful approach to supporting this new model.

🔍 General Feedback

The overall structure for Qwen3-Next integration appears well-designed, particularly the handling of alternating Gated Delta Net and Gated Attention layers.
The addition of configuration validation for gdn_num_value_heads is a good practice.
There are a few areas identified for potential improvement in terms of code clarity, naming conventions, and a critical logic change in the attention mechanism that warrants further review and verification.

github-actions · 2025-11-13T12:42:31Z

src/MaxText/layers/attentions.py

+    if self.is_qwen3_next:
+      out = out.reshape(batch_size, seq_len, self.config.num_query_heads * self.config.head_dim)
+      out = out * jax.nn.sigmoid(gate)
    out = self.out_projection(out, out_sharding=out_sharding)


🟠 High - The if self.is_qwen3_next: block has been moved to after the sharding logic. This means the reshaping and sigmoid gating for Qwen3-Next will now occur after the output has potentially been sharded. This could lead to incorrect behavior if the sharding expects a different shape or if the reshape/gating needs to happen before sharding. Please verify if this change is intentional and correct, or if the block should remain before the sharding logic.

Suggested change

out = self.out_projection(out, out_sharding=out_sharding)

if self.is_qwen3_next:

out = out.reshape(batch_size, seq_len, self.config.num_query_heads * self.config.head_dim)

out = out * jax.nn.sigmoid(gate)

if model_mode == MODEL_MODE_PREFILL:

out = self._maybe_shard_with_logical(out, self.prefill_out_axis_names)

elif model_mode == MODEL_MODE_TRAIN and self.config.expert_shard_attention_option == EP_AS_CONTEXT:

out = self._maybe_shard_with_logical(out, self.out_axis_names)

else:

out = self._maybe_shard_with_logical(out, self.decode_out_axis_names)

github-actions · 2025-11-13T12:42:35Z