PerceptionLM Image Processor doesn't properly allow for vision_input_type override

### System Info

- `transformers` version: 4.55.2
- Platform: Linux-5.10.0-35-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.13.5
- Huggingface_hub version: 0.34.4
- Safetensors version: 0.6.2
- Accelerate version: 1.10.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: doesn't matter, but yes
- GPU type: NVIDIA A100-SXM4-80GB

### Who can help?

Currently the PerceptionLM image processor does not properly allow us to override the vision_input_type from thumb+tile to vanilla if we'd like when applying the processor, as it only takes it from initialization which does not offer an override capability. I'd propose to making a change to [this line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/perception_lm/image_processing_perception_lm_fast.py#L286), which would still successfully inherit the default value but also be overridden by kwargs when manually creating the processor. 

@shuminghu is likely the best POC for this.

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
from transformers import AutoProcessor, AutoModelForImageTextToText
from huggingface_hub import hf_hub_download

MODEL_PATH = "facebook/Perception-LM-1B"
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH).to("cuda")

test_image_file = hf_hub_download(
            repo_id="shumingh/perception_lm_test_images",
            filename="14496_0.PNG",
            repo_type="dataset",
)
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": test_image_file,
            },
            {"type": "text", "text": "Describe the bar plot in the image."},
        ],
    }
]

inputs = processor.apply_chat_template(
    [conversation],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    images_kwargs={
        "vision_input_type": "vanilla",
        "tile_size": 448,
        "max_num_tiles": 1,
    }
)
```

### Expected behavior

Allow the override to happen like in the non-huggingface code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PerceptionLM Image Processor doesn't properly allow for vision_input_type override #40251

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PerceptionLM Image Processor doesn't properly allow for vision_input_type override #40251

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions