On the Consistency of Video Large Language Models in Temporal Comprehension

News

[2025.03.25] Evaluation Codes have been released.
[2025.02.27] Our paper has been accepted by CVPR 2025! 🎉
[2025.01.15] We are excited to share that our evaluation datasets, Charades-CON and ActivityNet-CON, are now available on Hugging Face! 🎉 Additionally, the training annotations for VTune have also been released.
[2025.01.14] We have released our four checkpoints using VTune: VideoLLaMA-7B-Charades-VTune, VideoLLaMA-7B-ActvityNet-VTune, TimeChat-7B-Charades-VTune, TimeChat-7B-ActvityNet-VTune. Additionally, checkpoints with naive fine-tuning: VideoLLaMA-7B-Charades-FT, VideoLLaMA-7B-ActvityNet-FT, TimeChat-7B-ActivityNet-FT have been released.
[2024.11.20] Our paper has been released on arXiv.

Introduction

We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.

Download

You can download the complete annotations for consistency evaluation from Hugging Face. The source videos are available via the following links:

Evaluation

Before starting the evaluation, make sure you have prepared the annotations and videos. You should also check the configuration of the Video-LLMs. Install the necessary dependencies using conda and pip for your model. Additionally, you may run utils/shift_video.py with the right paths to prepare shifted videos. Here, we provide an example with the model TimeChat. We will include additional baseline models in the future.

To run the evaluation, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --task consistency

dset_name refers to the test dataset, which can be either charades or activitynet. task refers to the evaluation task: either consistency or grounding. If set to grounding, the evaluation will be performed on the original test set. You can also use the --debug flag before performing the actual evaluation to verify your configuration settings.

Once the evaluation is complete, the performance will be reported in consistency_eval_results.json, and you can check the model's output in consistency_predictions.jsonl.

Training

For training, please download the training annotations for each dataset from Hugging Face.

⚠️ Important Note

The previously uploaded VTune dataset file charades_for_VTune.json partially includes videos from the Charades-Con test split. The updated file charades_train_v2.json and charades_for_VTune_v2.json excludes these overlapping videos. The corresponding hyperparameters should follow the table below. Note that neither dataset includes test videos from Charades-STA (the original one). We apologize for any inconvenience caused.

Dataset Name	iters_per_epochs	warmup_steps
`charades_for_VTune`	24,811	14,916
`charades_for_VTune2`	22,311	13,386

The performance of TimeChat trained with charades_for_VTune2:

Method	Ground	R-Ground	S-Ground	H-Verify	C-Verify
SFT	47.2	43.4 (91.8)	15.0 (31.9)	24.3 (51.5)	24.0 (50.9)
VTune	52.0	47.4 (91.2)	23.5 (45.2)	31.5 (60.5)	27.5 (52.9)

Checkpoints

We provide the checkpoints for each dataset using the links below:

Then, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --fine_tuned --task consistency

In the above example command, including the fine_tuned option will automatically switch the checkpoint path ckpt to activitynet_ckpt in timechat/eval_configs/timechat.yaml.

Citation

If you find our paper useful, please consider citing our paper.

@inproceedings{jung2025consistency,
  title={On the consistency of video large language models in temporal comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={13713--13722},
  year={2025}
}

Acknowledgement

We appreciate for the following awesome Video-LLMs:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Consistency of Video Large Language Models in Temporal Comprehension

News

Introduction

Download

Evaluation

Training

⚠️ Important Note

Checkpoints

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
eval		eval
scripts		scripts
task		task
timechat		timechat
utils		utils
.gitignore		.gitignore
README.md		README.md
run.py		run.py

minjoong507/Consistency-of-Video-LLM

Folders and files

Latest commit

History

Repository files navigation

On the Consistency of Video Large Language Models in Temporal Comprehension

News

Introduction

Download

Evaluation

Training

⚠️ Important Note

Checkpoints

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages