Skip to content

Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"

Notifications You must be signed in to change notification settings

deeplearning-wisc/mllmshift-emi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

by Changdae Oh1, Zhen Fang2, Shawn Im1, Xuefeng Du1, and Yixuan Li1.
1University of Wisconsin--Madison, 2University of Technology Sydney

Paper Poster Slide Dataset Dataset

Overview

In this repository, we highlight our proposals with corresponding code and instructions to reproduce our experiments and for potential broad usage.

Research Highlight

  • We presented effective mutual information (EMI), $\text{EMI}(P_{XY};P_{\theta}):=I(P_{X}\otimes P_{\theta})-I(P_{XY})$, as a new theory-grounded metric to assess the quality of outputs from an MLLM given input query.
    • Our theoretical analysis reveals the connection between EMI and LLM-judge-based pair-wise preference score, such as relative preference score or win rate.
  • Based on EMI, we proposed effective mutual information difference (EMID), $\text{EMID}(P_{XY},Q_{XY};P_{\theta}):=\text{EMI}(P_{XY};P_{\theta})-\text{EMI}(Q_{XY};P_{\theta})$, as an information-theoretic measure of MLLM robustness under distribution shifts.
    • We then provided theoretical upper bound of EMID, which is constructed by $D_{\rm JS}(P_{X_v}||Q_{X_v})$, $D_{\rm JS}(P_{X_t}||Q_{X_t})$, $D_{\rm JS}(P_{Y_{\theta}}||P_{Y})$, and $D_{\rm JS}(Q_{Y_{\theta}}||Q_{Y})$ terms, to characterize performance gap of MLLM under distribution shifts.
  • On 61 types of distribution shifts, we validated that empirical EMI estimates have strong correlation with relative preference scores, and EMID upper bound estimates consistently correlated with EMID estimates.

Procedure

Our project was built on top of LLaVA codebase, and we only provide the pipeline for EMI, EMID, and UpperBound computations here, so you can leverage more information about MLLM training and inference from LLaVA paper and repository.

  • Basic information

    • EMI consumes a (image_query:PILImage, text_query:str, model_response:str, GT_response:str) tuple as an input to access the quality of a model response.
    • EMID and EMID UB consume a pair of two tuples from different data distributions to measure the robustness of model response qualities across different input distributions.
    • We compute all the above quantities on top of embeddings from pre-trained encoder models such as CLIP-VIT and RoBERTa to bypass non-trivial MI modeling on raw input space.
  • EMID and its upper bound estimation on a pair of two datasets, e.g., one of in-distribution (ID) and one of out-of-distribution (OOD)

    1. Do inference on all datasets of your interest to gather responses $Y_{\theta}$ of your models given input queries.
    2. Get embedding vectors $\tilde{X}_{v}$, $\tilde{X}_{t}$, $\tilde{Y}_{\theta}$, and $\tilde{Y}_{gt}$ for the (image_query, text_query, model_response, GT_response) tuples with pre-trained vision and text encoders. If you don't have ground truth (GT) responses for datasets, get them by querying a reference model, e.g., GPT-4o.
    3. (Optional) Construct an embedding-pair dataset ${(\tilde{X},\tilde{Y})}$, and train a neural MI estimator on it.
    4. You can compute EMI and EMID by feeding embedding tuples into the (pre-)trained MI estimator.
    5. You can also compute EMID UB on top of embedding tuples with the RJSD estimator (See JSD_cov() function in main.py)

Environment

Exactly the same as the env of llava-v1.5 with datasets==3.5.0 installation.

conda create -n mllmshift-emi python=3.10 -y
conda activate mllmshift-emi

git clone https://github.com/deeplearning-wisc/mllmshift-emi.git
cd mllmshift-emi

pip install --upgrade pip
pip install -e .
pip install datasets==3.5.0

Data Preparation

  • To test new models on the llava-bench shift benchmarks, you need to prepare model responses on all kinds of distribution shifts scenarios (28 for natural, 35 for synthetic).
  • You can access our two types of benchmarks through Hugging Face dataset hub in public, llavabench-shift-synthetic-v1 and llavabench-shift-natural-v1, which contain image query, text query, and gt response (gpt4).
    • To generate synthetically perturbed datasets, we adopted defocus blur and frost for visual perturbations, and keyboard typo and synonym replacement as textual perturbations by leveraging MM_Robustness codebase.
  • Refer to the document for evaluation from LLaVA repository to get model responses by doing inference with your MLLMs.

Run

  • We provide a pre-trained weight for CLUB MI estimator at estimator_ckpt/CLUB_global.pt, so you don't need to re-train the MI estimator from scratch.
    • In contrast to that used in our paper, this estimator was trained on a pooled dataset of synthetic and natural shift datasets with >10K samples, whereas we previously used two separate MI estimators for synthetic and natural shifts.
      • So the replication results would be slightly different from the numbers in the paper.
    • [CAUTION!] If your downstream tasks are significantly distinct from the llava-bench family of datasets, you may need to retrain it.
  • For easy reproduction, we also provide the responses generated from llava-v1.5-13b and llava-v1.6-vicuna-13b models under the path data/{DATA_SPLIT_NAME}-{MODEL_NAME}.jsonl.
unzip data.zip
python main.py --model_name llava-v1.5-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.5-13b --shift_type NATURAL
python main.py --model_name llava-v1.6-vicuna-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.6-vicuna-13b --shift_type NATURAL
  • After running the above programs, you will find organized results at results/*.json.

Citation

If this repository was useful to your works, please consider citing our paper!

@inproceedings{
oh2025understanding,
title={Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach},
author={Oh, Changdae and Fang, Zhen and Im, Shawn and Du, Xuefeng and Li, Yixuan},
booktitle={International Conference on Machine Learning},
year={2025},
}

Acknowledgement

  • We appreciate the amazing work with a fully open codebase from the LLaVA authors that enables us to initiate our project.
  • We are also sincerely thankful for the authors of CLUB and RepresentationJSD that allow us to build a reliable estimation framework for the mutual information and Jensen-Shannon divergence.
  • We thank the authors of MM_Robustness repository used to construct our synthetic shift benchmarks.

About

Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages