- [2025.03.25] Evaluation Codes have been released.
- [2025.02.27] Our paper has been accepted by CVPR 2025! 🎉
- [2025.01.15] We are excited to share that our evaluation datasets, Charades-CON and ActivityNet-CON, are now available on Hugging Face! 🎉 Additionally, the training annotations for VTune have also been released.
- [2025.01.14] We have released our four checkpoints using VTune: VideoLLaMA-7B-Charades-VTune, VideoLLaMA-7B-ActvityNet-VTune, TimeChat-7B-Charades-VTune, TimeChat-7B-ActvityNet-VTune. Additionally, checkpoints with naive fine-tuning: VideoLLaMA-7B-Charades-FT, VideoLLaMA-7B-ActvityNet-FT, TimeChat-7B-ActivityNet-FT have been released.
- [2024.11.20] Our paper has been released on arXiv.
- We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.
You can download the complete annotations for consistency evaluation from Hugging Face. The source videos are available via the following links:
Before starting the evaluation, make sure you have prepared the annotations and videos. You should also check the configuration of the Video-LLMs. Install the necessary dependencies using conda and pip for your model. Additionally, you may run utils/shift_video.py with the right paths to prepare shifted videos.
Here, we provide an example with the model TimeChat. We will include additional baseline models in the future.
To run the evaluation, use the following command:
python run.py --model_type TimeChat --dset_name activitynet --task consistency
dset_name refers to the test dataset, which can be either charades or activitynet. task refers to the evaluation task: either consistency or grounding. If set to grounding, the evaluation will be performed on the original test set.
You can also use the --debug flag before performing the actual evaluation to verify your configuration settings.
Once the evaluation is complete, the performance will be reported in consistency_eval_results.json, and you can check the model's output in consistency_predictions.jsonl.
For training, please download the training annotations for each dataset from Hugging Face.
The previously uploaded VTune dataset file charades_for_VTune.json partially includes videos from the Charades-Con test split.
The updated file charades_train_v2.json and charades_for_VTune_v2.json excludes these overlapping videos.
The corresponding hyperparameters should follow the table below. Note that neither dataset includes test videos from Charades-STA (the original one).
We apologize for any inconvenience caused.
| Dataset Name | iters_per_epochs | warmup_steps |
|---|---|---|
charades_for_VTune |
24,811 | 14,916 |
charades_for_VTune2 |
22,311 | 13,386 |
The performance of TimeChat trained with charades_for_VTune2:
| Method | Ground | R-Ground | S-Ground | H-Verify | C-Verify |
|---|---|---|---|---|---|
| SFT | 47.2 | 43.4 (91.8) | 15.0 (31.9) | 24.3 (51.5) | 24.0 (50.9) |
| VTune | 52.0 | 47.4 (91.2) | 23.5 (45.2) | 31.5 (60.5) | 27.5 (52.9) |
We provide the checkpoints for each dataset using the links below:
Then, use the following command:
python run.py --model_type TimeChat --dset_name activitynet --fine_tuned --task consistency
In the above example command, including the fine_tuned option will automatically switch the checkpoint path ckpt to activitynet_ckpt in timechat/eval_configs/timechat.yaml.
If you find our paper useful, please consider citing our paper.
@inproceedings{jung2025consistency,
title={On the consistency of video large language models in temporal comprehension},
author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={13713--13722},
year={2025}
}We appreciate for the following awesome Video-LLMs:
