Add left pad for CausalLMPreprocessor #2343

pass-lin · 2025-07-25T02:28:58Z

We need to gradually add the left pad function to casualLM.
First step: #2242
Now, the second step is to add a left pad to the Causal LM Preprocessor.
Next, we will add a left pad to casualLM, and the gentlemen present will surely be amazed by it.

gemini-code-assist

Summary of Changes

Hello @pass-lin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capability for left padding within the CausalLMPreprocessor by adding a padding_side argument and implementing the necessary logic. This change is a foundational step towards enabling left padding in CausalLM models, providing greater flexibility for sequence processing, especially in generation tasks where left padding is often preferred.

Highlights

CausalLMPreprocessor now supports padding_side: The CausalLMPreprocessor class has been enhanced to accept a padding_side argument (either 'left' or 'right', defaulting to 'right') in its constructor and its call and generate_preprocess methods. This allows users to control the direction of padding for sequences.
Left padding implementation for CausalLMPreprocessor: A new logic path has been introduced within CausalLMPreprocessor's call and generate_preprocess methods to handle left padding. This involves a two-step padding process and leverages tf.xla.dynamic_update_slice for precise manipulation of padding masks. Left padding functionality will raise an ImportError if a TensorFlow installation with XLA is not available.
Padding utility layers updated: The call methods in MultiSegmentPacker and StartEndPacker (which are used internally by CausalLMPreprocessor) have been modified to accept and utilize a padding_side argument. This makes these lower-level padding utilities more flexible and enables the new left padding behavior.
Comprehensive test coverage for padding: New unit tests have been added to causal_lm_preprocessor_test.py to thoroughly validate the behavior of CausalLMPreprocessor with both left and right padding for its standard preprocessing (call) and generation (generate_preprocess) functionalities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces left-padding functionality to the CausalLMPreprocessor. The changes are logically sound, with the new padding_side option correctly propagated. The inclusion of tests is also commendable. My feedback focuses on improving the maintainability and performance of the new implementation by suggesting refactoring duplicated code into a helper method and using a more efficient API for tensor shape retrieval.

gemini-code-assist · 2025-07-25T02:31:40Z

keras_hub/src/models/causal_lm_preprocessor.py

+        if padding_side == "left":
+            addition_token_num = int(self.add_start_token + self.add_end_token)
+            token_ids, padding_mask = self.packer(
+                x,
+                sequence_length=x.to_tensor().shape[-1] + addition_token_num,
+                add_start_value=self.add_start_token,
+                add_end_value=self.add_end_token,
+                padding_side=padding_side,
+            )
+            token_ids, all_padding_mask = self.packer(
+                token_ids,
+                sequence_length=sequence_length + 1,
+                add_start_value=False,
+                add_end_value=False,
+                padding_side="right",
+            )
+            if dynamic_update_slice is None:
+                raise ImportError(
+                    "Left padding on CausalLMPreprocessor requires a TensorFlow "
+                    "installation with XLA available."
+                )
+            padding_mask = dynamic_update_slice(
+                all_padding_mask, padding_mask, [0] * len(padding_mask.shape)
+            )


The left padding logic in call is very similar to that in generate_preprocess (lines 174-195). This duplication increases maintenance overhead. Consider extracting this logic into a private helper method that can be called from both functions to improve code clarity and reduce redundancy.

gemini-code-assist · 2025-07-25T02:31:40Z

keras_hub/src/models/causal_lm_preprocessor.py

+            addition_token_num = int(self.add_start_token + self.add_end_token)
+            token_ids, padding_mask = self.packer(
+                x,
+                sequence_length=x.to_tensor().shape[-1] + addition_token_num,


Using x.to_tensor().shape[-1] can be inefficient because it creates a dense tensor from a ragged one just to get its shape. A more performant way to get the length of the longest sequence in a tf.RaggedTensor is to use x.bounding_shape()[-1]. This avoids the overhead of creating the intermediate dense tensor. This optimization also applies to line 177 in generate_preprocess.

Suggested change

sequence_length=x.to_tensor().shape[-1] + addition_token_num,

sequence_length=x.bounding_shape()[-1] + addition_token_num,

mattdangerw · 2025-07-25T21:52:21Z

I think we've chatted before on this, but the hard part is the modeling and generation side here, not the preprocessing. Can we start with a prototype of everything working for one model, figure out our usage, then start landing?

My main worry is a good plan modeling side. How are we going to land positional embedding changes to accommodate this? What does the actual high level UX look like with generate? Not fully fleshed out, but if I was going to try to land this with a good UX...

Consider looking at optional inputs to functional models (optional=True on a keras.Input), so we could extend our backbones to take position_ids, attention_mask and other inputs that would be helpful in some cases but not always necessary.
Make sure our position embedding layers can take in a position ids input of some sort, and that our sampling APIs are general enough to handle left padding.
Update our preprocessing layers to handle left padding.
Rewrite our generate compiled functions to handle left padding (configurably or as a default? not sure). We'd need some ability to provide position ids our backbones at this point.
Figure out a nice UX for this at all. Enable it by default for generation? Settable property on the preprocessor?

The hard stuff technically will be the positional inputs, sampling changes, and compiled generation changes, while keeping things relatively compatible. The preprocessing is fairly easy. I think a prototype with a model (fine to open a draft PR), say for Gemma 2 or Llama 3, would help us validate these harder areas first, before we start landing this. I'd avoid Gemma 3 for starters, it's extra hard with the need for bidirectional attention in image inputs.

pass-lin · 2025-07-26T01:47:26Z

I think we've chatted before on this, but the hard part is the modeling and generation side here, not the preprocessing. Can we start with a prototype of everything working for one model, figure out our usage, then start landing?

My main worry is a good plan modeling side. How are we going to land positional embedding changes to accommodate this? What does the actual high level UX look like with generate? Not fully fleshed out, but if I was going to try to land this with a good UX...

Consider looking at optional inputs to functional models (optional=True on a keras.Input), so we could extend our backbones to take position_ids, attention_mask and other inputs that would be helpful in some cases but not always necessary.

Make sure our position embedding layers can take in a position ids input of some sort, and that our sampling APIs are general enough to handle left padding.

Update our preprocessing layers to handle left padding.

Rewrite our generate compiled functions to handle left padding (configurably or as a default? not sure). We'd need some ability to provide position ids our backbones at this point.

Figure out a nice UX for this at all. Enable it by default for generation? Settable property on the preprocessor?

The hard stuff technically will be the positional inputs, sampling changes, and compiled generation changes, while keeping things relatively compatible. The preprocessing is fairly easy. I think a prototype with a model (fine to open a draft PR), say for Gemma 2 or Llama 3, would help us validate these harder areas first, before we start landing this. I'd avoid Gemma 3 for starters, it's extra hard with the need for bidirectional attention in image inputs.

You're right, but after thinking about it, I think maybe we should start with qwen or llama.
Mainly for the following reasons
First of all, both models are left-padding in the pre-training stage, so we don't need to consider various complex afferents
Secondly, even for the right padding model, the two models themselves are the best to deal with. There are only two points that need special attention. The first is the mask, which is implemented by this pr. The second is the position of the rope, but the nature of the rope determines that the value is the same for the same relative position.
Encoding Formulation

We can explain why rope has this form through the following method. This is the original text written by the author of rope.

Therefore, we can avoid complex models first and implement these two simple solutions first

If you think this PR is too simple, then I will implement a left pad implementation that supports qwen and llama

mattdangerw · 2025-08-05T21:51:30Z

@pass-lin fine to start with a end to end prototype for llama or qwen, and you are right ROPE in both these models is position invariant. It's not that this PR is too simple, it's that I think we need to peer ahead at where we are going a bit. Qwen or llama would allow us to figure out the user facing user experience and the changes to the compile generate that will be necessary. We do ultimately want a plan for models with absolute position embeddings (e.g. GPT2), so maybe worth thinking about that a bit, but fine to start on whatever model.

add left pad for CausalLMPreprocessor .

c323ab4

gemini-code-assist bot reviewed Jul 25, 2025

View reviewed changes

format

bf135d5

pass-lin closed this Aug 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add left pad for CausalLMPreprocessor #2343

Add left pad for CausalLMPreprocessor #2343

Uh oh!

pass-lin commented Jul 25, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 25, 2025

Uh oh!

gemini-code-assist bot Jul 25, 2025

Uh oh!

mattdangerw commented Jul 25, 2025 •

edited

Loading

Uh oh!

pass-lin commented Jul 26, 2025 •

edited

Loading

Uh oh!

mattdangerw commented Aug 5, 2025

Uh oh!

Uh oh!

	sequence_length=x.to_tensor().shape[-1] + addition_token_num,
	sequence_length=x.bounding_shape()[-1] + addition_token_num,

Add left pad for CausalLMPreprocessor #2343

Add left pad for CausalLMPreprocessor #2343

Uh oh!

Conversation

pass-lin commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pass-lin commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw commented Aug 5, 2025

Uh oh!

Uh oh!

pass-lin commented Jul 25, 2025 •

edited

Loading

mattdangerw commented Jul 25, 2025 •

edited

Loading

pass-lin commented Jul 26, 2025 •

edited

Loading