Skip to content

Add left pad for CausalLMPreprocessor #2343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

pass-lin
Copy link
Contributor

@pass-lin pass-lin commented Jul 25, 2025

We need to gradually add the left pad function to casualLM.
First step: #2242
Now, the second step is to add a left pad to the Causal LM Preprocessor.
Next, we will add a left pad to casualLM, and the gentlemen present will surely be amazed by it.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @pass-lin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capability for left padding within the CausalLMPreprocessor by adding a padding_side argument and implementing the necessary logic. This change is a foundational step towards enabling left padding in CausalLM models, providing greater flexibility for sequence processing, especially in generation tasks where left padding is often preferred.

Highlights

  • CausalLMPreprocessor now supports padding_side: The CausalLMPreprocessor class has been enhanced to accept a padding_side argument (either 'left' or 'right', defaulting to 'right') in its constructor and its call and generate_preprocess methods. This allows users to control the direction of padding for sequences.
  • Left padding implementation for CausalLMPreprocessor: A new logic path has been introduced within CausalLMPreprocessor's call and generate_preprocess methods to handle left padding. This involves a two-step padding process and leverages tf.xla.dynamic_update_slice for precise manipulation of padding masks. Left padding functionality will raise an ImportError if a TensorFlow installation with XLA is not available.
  • Padding utility layers updated: The call methods in MultiSegmentPacker and StartEndPacker (which are used internally by CausalLMPreprocessor) have been modified to accept and utilize a padding_side argument. This makes these lower-level padding utilities more flexible and enables the new left padding behavior.
  • Comprehensive test coverage for padding: New unit tests have been added to causal_lm_preprocessor_test.py to thoroughly validate the behavior of CausalLMPreprocessor with both left and right padding for its standard preprocessing (call) and generation (generate_preprocess) functionalities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces left-padding functionality to the CausalLMPreprocessor. The changes are logically sound, with the new padding_side option correctly propagated. The inclusion of tests is also commendable. My feedback focuses on improving the maintainability and performance of the new implementation by suggesting refactoring duplicated code into a helper method and using a more efficient API for tensor shape retrieval.

Comment on lines 109 to 132
if padding_side == "left":
addition_token_num = int(self.add_start_token + self.add_end_token)
token_ids, padding_mask = self.packer(
x,
sequence_length=x.to_tensor().shape[-1] + addition_token_num,
add_start_value=self.add_start_token,
add_end_value=self.add_end_token,
padding_side=padding_side,
)
token_ids, all_padding_mask = self.packer(
token_ids,
sequence_length=sequence_length + 1,
add_start_value=False,
add_end_value=False,
padding_side="right",
)
if dynamic_update_slice is None:
raise ImportError(
"Left padding on CausalLMPreprocessor requires a TensorFlow "
"installation with XLA available."
)
padding_mask = dynamic_update_slice(
all_padding_mask, padding_mask, [0] * len(padding_mask.shape)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The left padding logic in call is very similar to that in generate_preprocess (lines 174-195). This duplication increases maintenance overhead. Consider extracting this logic into a private helper method that can be called from both functions to improve code clarity and reduce redundancy.

addition_token_num = int(self.add_start_token + self.add_end_token)
token_ids, padding_mask = self.packer(
x,
sequence_length=x.to_tensor().shape[-1] + addition_token_num,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using x.to_tensor().shape[-1] can be inefficient because it creates a dense tensor from a ragged one just to get its shape. A more performant way to get the length of the longest sequence in a tf.RaggedTensor is to use x.bounding_shape()[-1]. This avoids the overhead of creating the intermediate dense tensor. This optimization also applies to line 177 in generate_preprocess.

Suggested change
sequence_length=x.to_tensor().shape[-1] + addition_token_num,
sequence_length=x.bounding_shape()[-1] + addition_token_num,

@mattdangerw
Copy link
Member

mattdangerw commented Jul 25, 2025

I think we've chatted before on this, but the hard part is the modeling and generation side here, not the preprocessing. Can we start with a prototype of everything working for one model, figure out our usage, then start landing?

My main worry is a good plan modeling side. How are we going to land positional embedding changes to accommodate this? What does the actual high level UX look like with generate? Not fully fleshed out, but if I was going to try to land this with a good UX...

  • Consider looking at optional inputs to functional models (optional=True on a keras.Input), so we could extend our backbones to take position_ids, attention_mask and other inputs that would be helpful in some cases but not always necessary.
  • Make sure our position embedding layers can take in a position ids input of some sort, and that our sampling APIs are general enough to handle left padding.
  • Update our preprocessing layers to handle left padding.
  • Rewrite our generate compiled functions to handle left padding (configurably or as a default? not sure). We'd need some ability to provide position ids our backbones at this point.
  • Figure out a nice UX for this at all. Enable it by default for generation? Settable property on the preprocessor?

The hard stuff technically will be the positional inputs, sampling changes, and compiled generation changes, while keeping things relatively compatible. The preprocessing is fairly easy. I think a prototype with a model (fine to open a draft PR), say for Gemma 2 or Llama 3, would help us validate these harder areas first, before we start landing this. I'd avoid Gemma 3 for starters, it's extra hard with the need for bidirectional attention in image inputs.

@pass-lin
Copy link
Contributor Author

pass-lin commented Jul 26, 2025

I think we've chatted before on this, but the hard part is the modeling and generation side here, not the preprocessing. Can we start with a prototype of everything working for one model, figure out our usage, then start landing?

My main worry is a good plan modeling side. How are we going to land positional embedding changes to accommodate this? What does the actual high level UX look like with generate? Not fully fleshed out, but if I was going to try to land this with a good UX...

  • Consider looking at optional inputs to functional models (optional=True on a keras.Input), so we could extend our backbones to take position_ids, attention_mask and other inputs that would be helpful in some cases but not always necessary.
  • Make sure our position embedding layers can take in a position ids input of some sort, and that our sampling APIs are general enough to handle left padding.
  • Update our preprocessing layers to handle left padding.
  • Rewrite our generate compiled functions to handle left padding (configurably or as a default? not sure). We'd need some ability to provide position ids our backbones at this point.
  • Figure out a nice UX for this at all. Enable it by default for generation? Settable property on the preprocessor?

The hard stuff technically will be the positional inputs, sampling changes, and compiled generation changes, while keeping things relatively compatible. The preprocessing is fairly easy. I think a prototype with a model (fine to open a draft PR), say for Gemma 2 or Llama 3, would help us validate these harder areas first, before we start landing this. I'd avoid Gemma 3 for starters, it's extra hard with the need for bidirectional attention in image inputs.

You're right, but after thinking about it, I think maybe we should start with qwen or llama.
Mainly for the following reasons
First of all, both models are left-padding in the pre-training stage, so we don't need to consider various complex afferents
Secondly, even for the right padding model, the two models themselves are the best to deal with. There are only two points that need special attention. The first is the mask, which is implemented by this pr. The second is the position of the rope, but the nature of the rope determines that the value is the same for the same relative position.
Encoding Formulation

We can explain why rope has this form through the following method. This is the original text written by the author of rope.

image

Therefore, we can avoid complex models first and implement these two simple solutions first

If you think this PR is too simple, then I will implement a left pad implementation that supports qwen and llama

@mattdangerw
Copy link
Member

@pass-lin fine to start with a end to end prototype for llama or qwen, and you are right ROPE in both these models is position invariant. It's not that this PR is too simple, it's that I think we need to peer ahead at where we are going a bit. Qwen or llama would allow us to figure out the user facing user experience and the changes to the compile generate that will be necessary. We do ultimately want a plan for models with absolute position embeddings (e.g. GPT2), so maybe worth thinking about that a bit, but fine to start on whatever model.

@pass-lin pass-lin closed this Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants