News-MESI: A Dataset for Multimodal News Excerpt Segmentation and Identification

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-03-14 DOI:10.1109/TETCI.2024.3369866

Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu

{"title":"News-MESI: A Dataset for Multimodal News Excerpt Segmentation and Identification","authors":"Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu","doi":"10.1109/TETCI.2024.3369866","DOIUrl":null,"url":null,"abstract":"In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as \n<bold>M</b>\nultimodal News \n<bold>E</b>\nxcerpt \n<bold>S</b>\negmentation and \n<bold>I</b>\ndentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 4","pages":"3001-3016"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10464379/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as M ultimodal News E xcerpt S egmentation and I dentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.

查看原文本刊更多论文

新闻-MESI：多模态新闻摘录分割与识别数据集

在复杂的长期新闻视频中，最基本的组成部分是由许多演播室和采访画面组成的新闻节选。从如此复杂的长期视频中发现并识别正确的新闻节选是一项极具挑战性的任务。除了固有的时间语义和复杂的一般事件交互之外，文本和视觉模式中丰富多样的语义也使问题变得更加复杂。在本文中，我们深入探讨了视频时间理解的细微差别，并从多模态和多任务的角度对其进行了研究。我们的研究涉及提出一个更精细的挑战，我们称之为多模态新闻摘录分割和识别。我们的目标是将新闻视频分割成单个帧级摘录，同时利用多模态语义为每个片段准确分配精心制作的标签。由于目前缺乏多模态精细时间分割数据集，我们建立了一个名为 News-MESI 的新基准来支持我们的研究。News-MESI 包含 150 多个来自数字媒体的高质量新闻视频，总时长约 150 小时，包含 2000 多个新闻节选。通过使用帧级摘录边界和精心设计的分类层次对其进行注释，该视频集为多模态语义理解这些与众不同的视频提供了宝贵的机会。我们还提出了一种新颖的算法，采用从粗到细的多模态融合和分层分类来解决这一问题。我们在基准上进行了广泛的实验，以展示新闻内容在本质上是如何随时间演变的。进一步的分析表明，多模态解决方案明显优于单模态解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.