VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI:10.1109/CVPR52734.2025.00794

Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

{"title":"VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?","authors":"Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu","doi":"10.1109/CVPR52734.2025.00794","DOIUrl":null,"url":null,"abstract":"<p><p>The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multi-modal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. Our benchmark is publicly available at https://yunlong10.github.io/VidComposition/.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"8490-8500"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413207/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52734.2025.00794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/13 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multi-modal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. Our benchmark is publicly available at https://yunlong10.github.io/VidComposition/.

查看原文本刊更多论文

视频作文：mlms可以分析编译视频中的作文吗？

多模态大语言模型（Multimodal Large Language Models, mllm）的发展使多模态理解取得了重大进展，扩展了它们分析视频内容的能力。然而，现有的mllm评估基准主要集中在抽象的视频理解上，缺乏对其理解视频组成的能力的详细评估，以及对视觉元素如何在高度编译的视频环境中组合和交互的细致解释。我们介绍了VidComposition，这是一个专门设计的新基准，用于评估mllm使用精心策划的编译视频和电影级注释的视频构图理解能力。VidComposition包括982个视频，1706个选择题，涵盖了镜头运动、角度、镜头大小、叙事结构、人物动作和情感等各个构图方面。我们对33个开源和专有mlm的综合评估揭示了人类和模型能力之间的显著性能差距。这突出了当前mlm在理解复杂、编译视频组合方面的局限性，并提供了进一步改进的领域。我们的基准可以在https://yunlong10.github.io/VidComposition/上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition

CiteScore

43.50

自引率

0.00%

发文量