The same situation as #31377 occurred when using Qwen/Qwen2-VL-7B-Instruct

### System Info


- `transformers` version: 4.45.0.dev0
- Platform: macOS-14.6.1-arm64-arm-64bit
- Python version: 3.12.4
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>


### Who can help?

@zucchini-nlp @amyer

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Run this code after git clone with the hash I specified above and pip install ./transformers

```
from transformers import Qwen2VLForConditionalGeneration,AutoModel,AutoProcessor

model_path=".models/Qwen/Qwen2-VL-7B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
             model_path,
             torch_dtype=torch.bfloat16,
             #attn_implementation="default"
        ).to(self.device) #device="mps"
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(model_path,
                                                   min_pixels=min_pixels, 
                                                   max_pixels=max_pixels
                                                   )
messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image"
                    },
                    {
                        "type": "text",
                        "text": "Extract text from pdf"
                    }
                ]
            }
        ]
base64_data = image_data.split(',')[1]  # remove 'data:image/jpeg;base64,' 
image_bytes = base64.b64decode(base64_data)
image = Image.open(io.BytesIO(image_bytes))
text = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(
      text=[text],
      images=[image],
).to(self.device)#device="mps"

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text  # Dummy return
```

### Expected behavior

File "/Users/dev/products/dev/workspaces/mixparse/llm/model/modelmanager.py", line 429, in _run_safetensors_inference
    generated_ids = model.generate(**inputs, max_new_tokens=128)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/transformers/generation/utils.py", line 2015, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/transformers/generation/utils.py", line 2965, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dev/anaconda3/envs/all-parse/lib/python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1683, in forward
    inputs_embeds[image_mask] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [630, 3584] cannot be broadcast to indexing result of shape [0, 3584]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The same situation as #31377 occurred when using Qwen/Qwen2-VL-7B-Instruct #33399

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The same situation as #31377 occurred when using Qwen/Qwen2-VL-7B-Instruct #33399

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions