Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach (ICML'25)
by Changdae Oh1, Zhen Fang2, Shawn Im1, Xuefeng Du1, and Yixuan Li1.
1University of Wisconsin--Madison, 2University of Technology Sydney
In this repository, we highlight our proposals with corresponding code and instructions to reproduce our experiments and for potential broad usage.
- We presented effective mutual information (EMI),
$\text{EMI}(P_{XY};P_{\theta}):=I(P_{X}\otimes P_{\theta})-I(P_{XY})$ , as a new theory-grounded metric to assess the quality of outputs from an MLLM given input query.- Our theoretical analysis reveals the connection between EMI and LLM-judge-based pair-wise preference score, such as relative preference score or win rate.
- Based on EMI, we proposed effective mutual information difference (EMID),
$\text{EMID}(P_{XY},Q_{XY};P_{\theta}):=\text{EMI}(P_{XY};P_{\theta})-\text{EMI}(Q_{XY};P_{\theta})$ , as an information-theoretic measure of MLLM robustness under distribution shifts.- We then provided theoretical upper bound of EMID, which is constructed by
$D_{\rm JS}(P_{X_v}||Q_{X_v})$ ,$D_{\rm JS}(P_{X_t}||Q_{X_t})$ ,$D_{\rm JS}(P_{Y_{\theta}}||P_{Y})$ , and$D_{\rm JS}(Q_{Y_{\theta}}||Q_{Y})$ terms, to characterize performance gap of MLLM under distribution shifts.
- We then provided theoretical upper bound of EMID, which is constructed by
- On 61 types of distribution shifts, we validated that empirical EMI estimates have strong correlation with relative preference scores, and EMID upper bound estimates consistently correlated with EMID estimates.
Our project was built on top of
LLaVA codebase
, and we only provide the pipeline for EMI, EMID, and UpperBound computations here, so you can leverage more information about MLLM training and inference from LLaVA paper and repository.
-
Basic information
- EMI consumes a
(image_query:PILImage, text_query:str, model_response:str, GT_response:str)
tuple as an input to access the quality of a model response. - EMID and EMID UB consume a pair of two tuples from different data distributions to measure the robustness of model response qualities across different input distributions.
- We compute all the above quantities on top of embeddings from pre-trained encoder models such as CLIP-VIT and RoBERTa to bypass non-trivial MI modeling on raw input space.
- EMI consumes a
-
EMID and its upper bound estimation on a pair of two datasets, e.g., one of in-distribution (ID) and one of out-of-distribution (OOD)
- Do inference on all datasets of your interest to gather responses
$Y_{\theta}$ of your models given input queries. - Get embedding vectors
$\tilde{X}_{v}$ ,$\tilde{X}_{t}$ ,$\tilde{Y}_{\theta}$ , and$\tilde{Y}_{gt}$ for the(image_query, text_query, model_response, GT_response)
tuples with pre-trained vision and text encoders. If you don't have ground truth (GT) responses for datasets, get them by querying a reference model, e.g., GPT-4o. - (Optional) Construct an embedding-pair dataset
${(\tilde{X},\tilde{Y})}$ , and train a neural MI estimator on it. - You can compute EMI and EMID by feeding embedding tuples into the (pre-)trained MI estimator.
- You can also compute EMID UB on top of embedding tuples with the RJSD estimator (See
JSD_cov()
function inmain.py
)
- Do inference on all datasets of your interest to gather responses
Exactly the same as the env of llava-v1.5 with datasets==3.5.0
installation.
conda create -n mllmshift-emi python=3.10 -y
conda activate mllmshift-emi
git clone https://github.com/deeplearning-wisc/mllmshift-emi.git
cd mllmshift-emi
pip install --upgrade pip
pip install -e .
pip install datasets==3.5.0
- To test new models on the llava-bench shift benchmarks, you need to prepare model responses on all kinds of distribution shifts scenarios (28 for natural, 35 for synthetic).
- You can access our two types of benchmarks through Hugging Face dataset hub in public,
llavabench-shift-synthetic-v1
andllavabench-shift-natural-v1
, which contain image query, text query, and gt response (gpt4).- To generate synthetically perturbed datasets, we adopted defocus blur and frost for visual perturbations, and keyboard typo and synonym replacement as textual perturbations by leveraging
MM_Robustness
codebase.
- To generate synthetically perturbed datasets, we adopted defocus blur and frost for visual perturbations, and keyboard typo and synonym replacement as textual perturbations by leveraging
- Refer to the document for evaluation from LLaVA repository to get model responses by doing inference with your MLLMs.
- We provide a pre-trained weight for CLUB MI estimator at
estimator_ckpt/CLUB_global.pt
, so you don't need to re-train the MI estimator from scratch.- In contrast to that used in our paper, this estimator was trained on a pooled dataset of synthetic and natural shift datasets with >10K samples, whereas we previously used two separate MI estimators for synthetic and natural shifts.
- So the replication results would be slightly different from the numbers in the paper.
- [
CAUTION!
] If your downstream tasks are significantly distinct from the llava-bench family of datasets, you may need to retrain it.
- In contrast to that used in our paper, this estimator was trained on a pooled dataset of synthetic and natural shift datasets with >10K samples, whereas we previously used two separate MI estimators for synthetic and natural shifts.
- For easy reproduction, we also provide the responses generated from
llava-v1.5-13b
andllava-v1.6-vicuna-13b
models under the pathdata/{DATA_SPLIT_NAME}-{MODEL_NAME}.jsonl
.
unzip data.zip
python main.py --model_name llava-v1.5-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.5-13b --shift_type NATURAL
python main.py --model_name llava-v1.6-vicuna-13b --shift_type SYNTHETIC
python main.py --model_name llava-v1.6-vicuna-13b --shift_type NATURAL
- After running the above programs, you will find organized results at
results/*.json
.
If this repository was useful to your works, please consider citing our paper!
@inproceedings{
oh2025understanding,
title={Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach},
author={Oh, Changdae and Fang, Zhen and Im, Shawn and Du, Xuefeng and Li, Yixuan},
booktitle={International Conference on Machine Learning},
year={2025},
}
- We appreciate the amazing work with a fully open codebase from the
LLaVA
authors that enables us to initiate our project. - We are also sincerely thankful for the authors of
CLUB
andRepresentationJSD
that allow us to build a reliable estimation framework for the mutual information and Jensen-Shannon divergence. - We thank the authors of
MM_Robustness
repository used to construct our synthetic shift benchmarks.