MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation.

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-19 DOI:10.1109/TPAMI.2025.3600507

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang

{"title":"MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation.","authors":"Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, Yu-Gang Jiang","doi":"10.1109/TPAMI.2025.3600507","DOIUrl":null,"url":null,"abstract":"<p><p>This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are released at https://henghuiding.github.io/MeViS.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3600507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are released at https://henghuiding.github.io/MeViS.

查看原文本刊更多论文

MeViS：用于参考运动表达视频分割的多模态数据集。

本文提出了一种用于参考运动表达视频分割的大规模多模态数据集，重点研究了基于物体运动语言描述的视频中目标物体的分割和跟踪。现有的参考视频分割数据集通常关注突出的对象，并使用富含静态属性的语言表达，从而有可能在单个帧中识别目标对象。这样的数据集低估了运动在视频和语言中的作用。为了探索在像素级视频理解中使用运动表达式和运动推理线索的可行性，我们引入了MeViS，这是一个包含33,072个文本和音频中人类注释的运动表达式的数据集，涵盖了2006个复杂场景视频中的8,171个对象。我们对MeViS支持的4个任务中的15种现有方法进行了基准测试，包括6种参考视频目标分割（RVOS）方法，3种音频引导视频目标分割（AVOS）方法，2种参考多目标跟踪（RMOT）方法，以及4种针对新引入的参考运动表达生成（RMEG）任务的视频字幕方法。结果显示了现有方法在解决运动表达引导视频理解方面的弱点和局限性。我们进一步分析了面临的挑战，并提出了一种用于RVOS/AVOS/RMOT的lmpm++方法，该方法可以实现最新的最先进的结果。我们的数据集提供了一个平台，促进了复杂视频场景中运动表达引导视频理解算法的开发。提出的MeViS数据集和方法的源代码发布在https://henghuiding.github.io/MeViS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量