fix Glm4v batch videos forward #39172

Kuangdd01 · 2025-07-02T11:07:12Z

What does this PR do?

Fixes the issues of video_processing and get_video_features for GLM4V.

Have tested with following scripts

import torch
from transformers import AutoProcessor, Glm4vForConditionalGeneration
from PIL import Image
import numpy as np
import cv2
import os
from dataclasses import dataclass
from transformers.video_utils import VideoMetadata

def prepare_video_metadata(videos):
    video_metadata = []
    for video in videos:
        if isinstance(video, list):
            num_frames = len(video)
        elif hasattr(video, "shape"):
            if len(video.shape) == 4:  # (T, H, W, C)
                num_frames = video.shape[0]
            else:
                num_frames = 1
        else:
            num_frames = 8
            print("eeeeee")

        metadata = {
            "fps": 2,
            "duration": num_frames / 2,
            "total_frames": num_frames,
        }
        video_metadata.append(metadata)
    return video_metadata

def test_video_processing(video_path_list, num_frames=4):
    selected_frames = []
    for video_path in video_path_list:
        cap = cv2.VideoCapture(video_path)
        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        print(f"Total frames: {frame_count}")

    video_metadata = []
    for video_path in video_path_list:
        temp_frames = []
        cap = cv2.VideoCapture(video_path)
        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        step = max(frame_count // num_frames, 1)
        for i in range(0, frame_count, step):
            cap.set(cv2.CAP_PROP_POS_FRAMES, i)
            ret, frame = cap.read()
            if not ret:
                continue
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            pil_img = Image.fromarray(frame_rgb)
            temp_frames.append(pil_img)
        selected_frames.append(temp_frames)

    video_metadata = prepare_video_metadata(selected_frames)
    video_inputs = processor.video_processor(videos=selected_frames, video_metadata=video_metadata)

    questions = ["What kind of dog is this?", "Describe the background."]

    messages_batch = [
        [
            {
                "role": "user",
                "content": [
                    {"type": "video"},
                    {"type": "text", "text": question},
                ],
            }
        ]
        for question in questions
    ]

    texts = [
        processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
        for msg in messages_batch
    ]

    inputs_batch = processor(text=texts, videos=selected_frames, video_metadata=video_metadata, return_tensors="pt", padding=True)

    print(processor.batch_decode(inputs_batch['input_ids'])[0])
    rope_pos, deltas = model.model.get_rope_index(
        inputs_batch["input_ids"],
        None,
        inputs_batch["video_grid_thw"],
        inputs_batch["attention_mask"]
    )

    print(rope_pos.shape, "\n", deltas)

processor_name = "THUDM/GLM-4.1V-9B-Thinking"
processor = AutoProcessor.from_pretrained(processor_name)
model = Glm4vForConditionalGeneration.from_pretrained(processor_name)

if __name__ == "__main__":
    # image_path = "./data/mllm_demo_data/1.jpg"
    video_path_1 = "./data/mllm_demo_data/1.mp4"
    video_path_2 = "./data/mllm_demo_data/2.avi"

    test_video_processing([video_path_1, video_path_2])

For forward logits checking, @zRzRzRzRzRzRzR

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp cc @zRzRzRzRzRzRzR

Kuangdd01 · 2025-07-02T11:19:47Z

Failed for changing the get_video_features which is not consistent with that generated from modular. 😂

zucchini-nlp · 2025-07-02T14:17:23Z

src/transformers/models/glm4v/video_processing_glm4v.py

-        total_frames = video_grid_thw[0][0].item()
-        h = video_grid_thw[0][1].item()
-        w = video_grid_thw[0][2].item()
-        video_grid_thw = [[1, h, w] for _ in range(total_frames)]


I think we also would need to pad timestamps as otherwise it will fail when different number of frames are sampled per video. We've been discussing it internally with @zRzRzRzRzRzRzR , not sure though if he has any PR yet

Yes, timestamps is not good to return here, can we return it like qwen2_5vl does ?

transformers/src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py

Lines 158 to 166 in df12d87

if isinstance(fps, (int, float)):

second_per_grid_ts = [self.video_processor.temporal_patch_size / fps] * len(video_grid_thw)

elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):

second_per_grid_ts = [self.video_processor.temporal_patch_size / tmp for tmp in fps]

else:

raise ValueError(

f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number."

)

videos_inputs.update({"second_per_grid_ts": second_per_grid_ts})

zucchini-nlp · 2025-07-02T14:20:45Z

src/transformers/models/glm4v/processing_glm4v.py

+                    num_image_tokens = (
+                        video_grid_thw[video_index].prod() // merge_length // video_grid_thw[video_index][0]
+                    )
+                    for frame_idx in range(num_frames):
+                        if self.image_token in text[i]:
+                            text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
+


Perfect, this has been itching me since release ❤️ I agree this works when equal amount of frames are sampled per video

zucchini-nlp · 2025-07-02T14:23:08Z

src/transformers/models/glm4v/modeling_glm4v.py

+        # reshape video_grid_thw -> [b, 3] -> [1, h, w] * frames
+        temp_frames_hw = []
+        for t, h, w in video_grid_thw:
+            repeated_row = torch.tensor([1, h.item(), w.item()]).unsqueeze(0).repeat(t, 1)
+            temp_frames_hw.append(repeated_row)
+        flattened_video_grid_thw = torch.cat(temp_frames_hw, dim=0)
+        video_embeds = self.visual(pixel_values_videos, grid_thw=flattened_video_grid_thw)


oh, prob because this is just copied from Qwen2-VL when running modular. To actually fix it, we need to overwrite get_video_features in modular_glm4v.py instead of inheriting from Qwen2-VL

Thanks! I have overwritten this function in the modular_glm4v.py

github-actions · 2025-07-02T17:22:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: glm4v

Kuangdd01 · 2025-07-02T17:32:34Z

😀Do I need to write more unit tests for this change?

Kuangdd01 added 4 commits June 30, 2025 17:47

changes for video

adc82c8

Merge branch 'huggingface:main' into glm4v

807af61

update modular

b729471

change get_video_features

5df3828

Kuangdd01 added 2 commits July 2, 2025 11:47

update video token replacement

454b4a3

Merge branch 'main' into glm4v

280e506

zucchini-nlp reviewed Jul 2, 2025

View reviewed changes

update modular

2ad2ea2

Kuangdd01 mentioned this pull request Jul 3, 2025

[model] add GLM-4.1V hiyouga/LLaMA-Factory#8462

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix Glm4v batch videos forward #39172

fix Glm4v batch videos forward #39172

Uh oh!

Kuangdd01 commented Jul 2, 2025 •

edited

Loading

Uh oh!

Kuangdd01 commented Jul 2, 2025

Uh oh!

zucchini-nlp Jul 2, 2025

Uh oh!

Kuangdd01 Jul 2, 2025

Uh oh!

zucchini-nlp Jul 2, 2025

Uh oh!

zucchini-nlp Jul 2, 2025

Uh oh!

Kuangdd01 Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

Kuangdd01 commented Jul 2, 2025

Uh oh!

Uh oh!

	if isinstance(fps, (int, float)):
	second_per_grid_ts = [self.video_processor.temporal_patch_size / fps] * len(video_grid_thw)
	elif hasattr(fps, "__len__") and len(fps) == len(video_grid_thw):
	second_per_grid_ts = [self.video_processor.temporal_patch_size / tmp for tmp in fps]
	else:
	raise ValueError(
	f"The length of fps ({len(fps) if hasattr(fps, '__len__') else fps}) must be equal to the length of video_grid_thw ({len(video_grid_thw)}) or fps should be a single number."
	)
	videos_inputs.update({"second_per_grid_ts": second_per_grid_ts})

fix Glm4v batch videos forward #39172

Are you sure you want to change the base?

fix Glm4v batch videos forward #39172

Uh oh!

Conversation

Kuangdd01 commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Kuangdd01 commented Jul 2, 2025

Uh oh!

zucchini-nlp Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Kuangdd01 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Kuangdd01 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

Kuangdd01 commented Jul 2, 2025

Uh oh!

Uh oh!

Kuangdd01 commented Jul 2, 2025 •

edited

Loading