Skip to content

Commit e5d7b81

Browse files
Yinhong LiuwilliamLyh
authored andcommitted
update readme
1 parent 8e05635 commit e5d7b81

File tree

5 files changed

+30
-21
lines changed

5 files changed

+30
-21
lines changed

README.md

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
**Updates: Our work has been accepted by EMNLP 2025 🎉**
66

7-
This is the official repository for the **MDSEval** benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, **Mutually Exclusive Key Information (MEKI)**. MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
7+
This is the official repository for the [**MDSEval**](https://arxiv.org/abs/2510.01659) benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, **Mutually Exclusive Key Information (MEKI)**. MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
88

99
⚠️ **Note:** MDSEval is an **evaluation benchmark**. The data provided here should **not** be used for training NLP models.
1010

@@ -23,14 +23,7 @@ To ensure data quality and diversity, we introduce a novel filtering framework,
2323
Our contributions include:
2424
- The first formalization of key evaluation dimensions specific to MDS
2525
- A high-quality benchmark dataset for robust evaluation
26-
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
27-
28-
## Dependencies
29-
---
30-
Besides the `requirements.txt`, we additionaly depends on:
31-
* The [google-research](https://github.com/google-research/google-research) with install command in `prepare_dialog_data.sh`
32-
* The external images provided in `MDSEval_annotations.json` with download script in `prepare_image_data.sh`
33-
* The model checkpoint [ViT-H-14-378-quickgelu](https://huggingface.co/immich-app/ViT-H-14-378-quickgelu__dfn5b) loaded by `meki.py`
26+
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
3427

3528
## Download the Dialogue and Image Data
3629
---
@@ -88,24 +81,34 @@ To ensure the dataset is sufficiently challenging for multimodal summarization,
8881
We embed both the image and textual dialogue into a **shared semantic space**, e.g. using the CLIP model, denoted as vectors $I\in \mathbb{R}^N$ and $T \in \mathbb{R}^N$. $N$ is the embedding dimension. Since CLIP embeddings are unit-normalized, we maintain this normalization for consistency.
8982

9083
To measure **Exclusive Information (EI)** in $I$ that is not present in $T$, we compute the orthogonal component of $I$ relative to $T$:
91-
\[
84+
<!-- \[
9285
% \operatorname{EI}(I|T) =
9386
I_T^\perp = I - \operatorname{Proj}_T(I) = I - \frac{\langle I, T\rangle}{\langle T, T\rangle} T,
94-
\]
87+
\] -->
88+
89+
<img src="logo/equ1.png" width="400">
90+
9591
where $\langle \cdot , \cdot \rangle$ denote the dot product.
9692

9793
Next, to identify **Exclusive Key Information (EKI)** — crucial content uniquely conveyed by one modality — we first generate a pseudo-summary $S$, which extracts essential dialogue and image details. This serves as a reference proxy rather than a precise summary, helping distinguish key information. We embed and normalize $S$ in the CLIP space and compute:
98-
\[
94+
<!-- \[
9995
\operatorname{EKI}(I|T; S) =
10096
% \| \operatorname{Proj}_S(I_T^\perp) \| =
10197
\left\| \frac{\langle I_T^\perp, S\rangle}{\langle S, S\rangle} S \right\|
102-
\]
98+
\] -->
99+
100+
<img src="logo/equ2.png" width="350">
101+
102+
103103
which quantifies the extent of exclusive image-based key information. Similarly, we compute $\operatorname{EKI}(T|I; S)$ for textual exclusivity.
104104

105105
Finally, the MEKI score aggregates both components:
106-
\[
106+
<!-- \[
107107
\operatorname{MEKI}(I, T; S) = \lambda \operatorname{EKI}(I \mid T; S) + (1-\lambda)\operatorname{EKI}(T \mid I; S)
108-
\]
108+
\] -->
109+
110+
<img src="logo/equ3.png" width="600">
111+
109112
where $\lambda=0.3$, chosen to balance the typically higher magnitude of the exclusivity term in text-based information, ensuring that the average magnitudes of both terms are approximately equal.
110113

111114

@@ -128,10 +131,15 @@ Accordingly, we release MDSEval under the Apache 2.0 License.
128131
---
129132
If you found the benchmark useful, please consider citing our work.
130133

131-
## Other
132-
---
133-
This is an intern project which has ended. Therefore, there will be no regular updates for this repository.
134-
135-
136-
137134

135+
```
136+
@misc{liu2025mdsevalmetaevaluationbenchmarkmultimodal,
137+
title={MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization},
138+
author={Yinhong Liu and Jianfeng He and Hang Su and Ruixue Lian and Yi Nian and Jake Vincent and Srikanth Vishnubhotla and Robinson Piramuthu and Saab Mansour},
139+
year={2025},
140+
eprint={2510.01659},
141+
archivePrefix={arXiv},
142+
primaryClass={cs.CL},
143+
url={https://arxiv.org/abs/2510.01659},
144+
}
145+
```

logo/equ1.png

20 KB
Loading

logo/equ2.png

20.5 KB
Loading

logo/equ3.png

25.8 KB
Loading

test.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
test

0 commit comments

Comments
 (0)