-
Notifications
You must be signed in to change notification settings - Fork 30.5k
Closed
Labels
Description
System Info
transformers
version: 4.55.2- Platform: Linux-5.10.0-35-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.13.5
- Huggingface_hub version: 0.34.4
- Safetensors version: 0.6.2
- Accelerate version: 1.10.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: doesn't matter, but yes
- GPU type: NVIDIA A100-SXM4-80GB
Who can help?
Currently the PerceptionLM image processor does not properly allow us to override the vision_input_type from thumb+tile to vanilla if we'd like when applying the processor, as it only takes it from initialization which does not offer an override capability. I'd propose to making a change to this line, which would still successfully inherit the default value but also be overridden by kwargs when manually creating the processor.
@shuminghu is likely the best POC for this.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download
MODEL_PATH = "facebook/Perception-LM-1B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")
test_image_file = hf_hub_download(
repo_id="shumingh/perception_lm_test_images",
filename="14496_0.PNG",
repo_type="dataset",
)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
"url": test_image_file,
},
{"type": "text", "text": "Describe the bar plot in the image."},
],
}
]
inputs = processor.apply_chat_template(
[conversation],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
images_kwargs={
"vision_input_type": "vanilla",
"tile_size": 448,
"max_num_tiles": 1,
}
)
Expected behavior
Allow the override to happen like in the non-huggingface code.