-
Notifications
You must be signed in to change notification settings - Fork 29.5k
fix Glm4v batch videos forward #39172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Failed for changing the |
total_frames = video_grid_thw[0][0].item() | ||
h = video_grid_thw[0][1].item() | ||
w = video_grid_thw[0][2].item() | ||
video_grid_thw = [[1, h, w] for _ in range(total_frames)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also would need to pad timestamps
as otherwise it will fail when different number of frames are sampled per video. We've been discussing it internally with @zRzRzRzRzRzRzR , not sure though if he has any PR yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, timestamps
is not good to return here, can we return it like qwen2_5vl does ?
transformers/src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py
Lines 158 to 166 in df12d87
if isinstance(fps, (int, float)): | |
second_per_grid_ts = [self.video_processor.temporal_patch_size / fps] * len(video_grid_thw) | |
elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw): | |
second_per_grid_ts = [self.video_processor.temporal_patch_size / tmp for tmp in fps] | |
else: | |
raise ValueError( | |
f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number." | |
) | |
videos_inputs.update({"second_per_grid_ts": second_per_grid_ts}) |
num_image_tokens = ( | ||
video_grid_thw[video_index].prod() // merge_length // video_grid_thw[video_index][0] | ||
) | ||
for frame_idx in range(num_frames): | ||
if self.image_token in text[i]: | ||
text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, this has been itching me since release ❤️ I agree this works when equal amount of frames are sampled per video
# reshape video_grid_thw -> [b, 3] -> [1, h, w] * frames | ||
temp_frames_hw = [] | ||
for t, h, w in video_grid_thw: | ||
repeated_row = torch.tensor([1, h.item(), w.item()]).unsqueeze(0).repeat(t, 1) | ||
temp_frames_hw.append(repeated_row) | ||
flattened_video_grid_thw = torch.cat(temp_frames_hw, dim=0) | ||
video_embeds = self.visual(pixel_values_videos, grid_thw=flattened_video_grid_thw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, prob because this is just copied from Qwen2-VL when running modular
. To actually fix it, we need to overwrite get_video_features
in modular_glm4v.py
instead of inheriting from Qwen2-VL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have overwritten this function in the modular_glm4v.py
[For maintainers] Suggested jobs to run (before merge) run-slow: glm4v |
😀Do I need to write more unit tests for this change? |
What does this PR do?
Fixes the issues of video_processing and get_video_features for GLM4V.
Have tested with following scripts
For forward logits checking, @zRzRzRzRzRzRzR
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp cc @zRzRzRzRzRzRzR