Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach (ICML'25)

by Changdae Oh¹, Zhen Fang², Shawn Im¹, Xuefeng Du¹, and Yixuan Li¹.
¹University of Wisconsin--Madison, ²University of Technology Sydney

Overview

In this repository, we highlight our proposals with corresponding code and instructions to reproduce our experiments and for potential broad usage.

Research Highlight

We presented effective mutual information (EMI), $\text{EMI}(P_{XY};P_{\theta}):=I(P_{X}\otimes P_{\theta})-I(P_{XY})$, as a new theory-grounded metric to assess the quality of outputs from an MLLM given input query.
- Our theoretical analysis reveals the connection between EMI and LLM-judge-based pair-wise preference score, such as relative preference score or win rate.
Based on EMI, we proposed effective mutual information difference (EMID), $\text{EMID}(P_{XY},Q_{XY};P_{\theta}):=\text{EMI}(P_{XY};P_{\theta})-\text{EMI}(Q_{XY};P_{\theta})$, as an information-theoretic measure of MLLM robustness under distribution shifts.
- We then provided theoretical upper bound of EMID, which is constructed by $D_{\rm JS}(P_{X_v}||Q_{X_v})$, $D_{\rm JS}(P_{X_t}||Q_{X_t})$, $D_{\rm JS}(P_{Y_{\theta}}||P_{Y})$, and $D_{\rm JS}(Q_{Y_{\theta}}||Q_{Y})$ terms, to characterize performance gap of MLLM under distribution shifts.
On 61 types of distribution shifts, we validated that empirical EMI estimates have strong correlation with relative preference scores, and EMID upper bound estimates consistently correlated with EMID estimates.

Procedure

Our project was built on top of LLaVA codebase, and we only provide the pipeline for EMI, EMID, and UpperBound computations here, so you can leverage more information about MLLM training and inference from LLaVA paper and repository.

Basic information
- EMI consumes a (image_query:PILImage, text_query:str, model_response:str, GT_response:str) tuple as an input to access the quality of a model response.
- EMID and EMID UB consume a pair of two tuples from different data distributions to measure the robustness of model response qualities across different input distributions.
- We compute all the above quantities on top of embeddings from pre-trained encoder models such as CLIP-VIT and RoBERTa to bypass non-trivial MI modeling on raw input space.
EMID and its upper bound estimation on a pair of two datasets, e.g., one of in-distribution (ID) and one of out-of-distribution (OOD)
1. Do inference on all datasets of your interest to gather responses $Y_{\theta}$ of your models given input queries.
2. Get embedding vectors $\tilde{X}_{v}$, $\tilde{X}_{t}$, $\tilde{Y}_{\theta}$, and $\tilde{Y}_{gt}$ for the (image_query, text_query, model_response, GT_response) tuples with pre-trained vision and text encoders. If you don't have ground truth (GT) responses for datasets, get them by querying a reference model, e.g., GPT-4o.
3. (Optional) Construct an embedding-pair dataset ${(\tilde{X},\tilde{Y})}$, and train a neural MI estimator on it.
4. You can compute EMI and EMID by feeding embedding tuples into the (pre-)trained MI estimator.
5. You can also compute EMID UB on top of embedding tuples with the RJSD estimator (See JSD_cov() function in main.py)

Environment

Exactly the same as the env of llava-v1.5 with datasets==3.5.0 installation.

conda create -n mllmshift-emi python=3.10 -y
conda activate mllmshift-emi

git clone https://github.com/deeplearning-wisc/mllmshift-emi.git
cd mllmshift-emi

pip install --upgrade pip
pip install -e .
pip install datasets==3.5.0

Data Preparation

To test new models on the llava-bench shift benchmarks, you need to prepare model responses on all kinds of distribution shifts scenarios (28 for natural, 35 for synthetic).
You can access our two types of benchmarks through Hugging Face dataset hub in public, llavabench-shift-synthetic-v1 and llavabench-shift-natural-v1, which contain image query, text query, and gt response (gpt4).
- To generate synthetically perturbed datasets, we adopted defocus blur and frost for visual perturbations, and keyboard typo and synonym replacement as textual perturbations by leveraging MM_Robustness codebase.
Refer to the document for evaluation from LLaVA repository to get model responses by doing inference with your MLLMs.

Run

We provide a pre-trained weight for CLUB MI estimator at estimator_ckpt/CLUB_global.pt, so you don't need to re-train the MI estimator from scratch.
- In contrast to that used in our paper, this estimator was trained on a pooled dataset of synthetic and natural shift datasets with >10K samples, whereas we previously used two separate MI estimators for synthetic and natural shifts.
  - So the replication results would be slightly different from the numbers in the paper.
- [CAUTION!] If your downstream tasks are significantly distinct from the llava-bench family of datasets, you may need to retrain it.
For easy reproduction, we also provide the responses generated from llava-v1.5-13b and llava-v1.6-vicuna-13b models under the path data/{DATA_SPLIT_NAME}-{MODEL_NAME}.jsonl.

unzip data.zip
python main.py --model_name llava-v1.5-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.5-13b --shift_type NATURAL
python main.py --model_name llava-v1.6-vicuna-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.6-vicuna-13b --shift_type NATURAL

After running the above programs, you will find organized results at results/*.json.

Citation

If this repository was useful to your works, please consider citing our paper!

@inproceedings{
oh2025understanding,
title={Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach},
author={Oh, Changdae and Fang, Zhen and Im, Shawn and Du, Xuefeng and Li, Yixuan},
booktitle={International Conference on Machine Learning},
year={2025},
}

Acknowledgement

We appreciate the amazing work with a fully open codebase from the LLaVA authors that enables us to initiate our project.
We are also sincerely thankful for the authors of CLUB and RepresentationJSD that allow us to build a reliable estimation framework for the mutual information and Jensen-Shannon divergence.
We thank the authors of MM_Robustness repository used to construct our synthetic shift benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
estimator_ckpt		estimator_ckpt
results		results
README.md		README.md
cog.yaml		cog.yaml
data.zip		data.zip
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach (ICML'25)

Overview

Research Highlight

Procedure

Environment

Data Preparation

Run

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

deeplearning-wisc/mllmshift-emi

Folders and files

Latest commit

History

Repository files navigation

Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach (ICML'25)

Overview

Research Highlight

Procedure

Environment

Data Preparation

Run

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages