Skip to content

PerceptionLM Image Processor doesn't properly allow for vision_input_type override #40251

@tyleryzhu

Description

@tyleryzhu

System Info

  • transformers version: 4.55.2
  • Platform: Linux-5.10.0-35-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.13.5
  • Huggingface_hub version: 0.34.4
  • Safetensors version: 0.6.2
  • Accelerate version: 1.10.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: doesn't matter, but yes
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

Currently the PerceptionLM image processor does not properly allow us to override the vision_input_type from thumb+tile to vanilla if we'd like when applying the processor, as it only takes it from initialization which does not offer an override capability. I'd propose to making a change to this line, which would still successfully inherit the default value but also be overridden by kwargs when manually creating the processor.

@shuminghu is likely the best POC for this.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download

MODEL_PATH = "facebook/Perception-LM-1B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")

test_image_file = hf_hub_download(
            repo_id="shumingh/perception_lm_test_images",
            filename="14496_0.PNG",
            repo_type="dataset",
)
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": test_image_file,
            },
            {"type": "text", "text": "Describe the bar plot in the image."},
        ],
    }
]

inputs = processor.apply_chat_template(
    [conversation],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    images_kwargs={
        "vision_input_type": "vanilla",
        "tile_size": 448,
        "max_num_tiles": 1,
    }
)

Expected behavior

Allow the override to happen like in the non-huggingface code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions