You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+29-21Lines changed: 29 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
**Updates: Our work has been accepted by EMNLP 2025 🎉**
6
6
7
-
This is the official repository for the **MDSEval** benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, **Mutually Exclusive Key Information (MEKI)**. MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
7
+
This is the official repository for the [**MDSEval**](https://arxiv.org/abs/2510.01659) benchmark. It includes all human annotations, benchmark data, and the implementation of our newly proposed data filtering framework, **Mutually Exclusive Key Information (MEKI)**. MEKI is designed to filter high-quality multimodal data by ensuring that each modality contributes unique information.
8
8
9
9
⚠️ **Note:** MDSEval is an **evaluation benchmark**. The data provided here should **not** be used for training NLP models.
10
10
@@ -23,14 +23,7 @@ To ensure data quality and diversity, we introduce a novel filtering framework,
23
23
Our contributions include:
24
24
- The first formalization of key evaluation dimensions specific to MDS
25
25
- A high-quality benchmark dataset for robust evaluation
26
-
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
27
-
28
-
## Dependencies
29
-
---
30
-
Besides the `requirements.txt`, we additionaly depends on:
31
-
* The [google-research](https://github.com/google-research/google-research) with install command in `prepare_dialog_data.sh`
32
-
* The external images provided in `MDSEval_annotations.json` with download script in `prepare_image_data.sh`
33
-
* The model checkpoint [ViT-H-14-378-quickgelu](https://huggingface.co/immich-app/ViT-H-14-378-quickgelu__dfn5b) loaded by `meki.py`
26
+
- A comprehensive assessment of state-of-the-art evaluation methods, showing their limitations in distinguishing between summaries from advanced MLLMs and their vulnerability to various biases
34
27
35
28
## Download the Dialogue and Image Data
36
29
---
@@ -88,24 +81,34 @@ To ensure the dataset is sufficiently challenging for multimodal summarization,
88
81
We embed both the image and textual dialogue into a **shared semantic space**, e.g. using the CLIP model, denoted as vectors $I\in \mathbb{R}^N$ and $T \in \mathbb{R}^N$. $N$ is the embedding dimension. Since CLIP embeddings are unit-normalized, we maintain this normalization for consistency.
89
82
90
83
To measure **Exclusive Information (EI)** in $I$ that is not present in $T$, we compute the orthogonal component of $I$ relative to $T$:
91
-
\[
84
+
<!--\[
92
85
% \operatorname{EI}(I|T) =
93
86
I_T^\perp = I - \operatorname{Proj}_T(I) = I - \frac{\langle I, T\rangle}{\langle T, T\rangle} T,
94
-
\]
87
+
\] -->
88
+
89
+
<imgsrc="logo/equ1.png"width="400">
90
+
95
91
where $\langle \cdot , \cdot \rangle$ denote the dot product.
96
92
97
93
Next, to identify **Exclusive Key Information (EKI)** — crucial content uniquely conveyed by one modality — we first generate a pseudo-summary $S$, which extracts essential dialogue and image details. This serves as a reference proxy rather than a precise summary, helping distinguish key information. We embed and normalize $S$ in the CLIP space and compute:
98
-
\[
94
+
<!--\[
99
95
\operatorname{EKI}(I|T; S) =
100
96
% \| \operatorname{Proj}_S(I_T^\perp) \| =
101
97
\left\| \frac{\langle I_T^\perp, S\rangle}{\langle S, S\rangle} S \right\|
102
-
\]
98
+
\] -->
99
+
100
+
<imgsrc="logo/equ2.png"width="350">
101
+
102
+
103
103
which quantifies the extent of exclusive image-based key information. Similarly, we compute $\operatorname{EKI}(T|I; S)$ for textual exclusivity.
104
104
105
105
Finally, the MEKI score aggregates both components:
where $\lambda=0.3$, chosen to balance the typically higher magnitude of the exclusivity term in text-based information, ensuring that the average magnitudes of both terms are approximately equal.
110
113
111
114
@@ -128,10 +131,15 @@ Accordingly, we release MDSEval under the Apache 2.0 License.
128
131
---
129
132
If you found the benchmark useful, please consider citing our work.
130
133
131
-
## Other
132
-
---
133
-
This is an intern project which has ended. Therefore, there will be no regular updates for this repository.
title={MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization},
138
+
author={Yinhong Liu and Jianfeng He and Hang Su and Ruixue Lian and Yi Nian and Jake Vincent and Srikanth Vishnubhotla and Robinson Piramuthu and Saab Mansour},
0 commit comments