新闻-MESI:多模态新闻摘录分割与识别数据集

IF 5.3 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu
{"title":"新闻-MESI:多模态新闻摘录分割与识别数据集","authors":"Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu","doi":"10.1109/TETCI.2024.3369866","DOIUrl":null,"url":null,"abstract":"In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as \n<bold>M</b>\nultimodal News \n<bold>E</b>\nxcerpt \n<bold>S</b>\negmentation and \n<bold>I</b>\ndentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 4","pages":"3001-3016"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"News-MESI: A Dataset for Multimodal News Excerpt Segmentation and Identification\",\"authors\":\"Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu\",\"doi\":\"10.1109/TETCI.2024.3369866\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as \\n<bold>M</b>\\nultimodal News \\n<bold>E</b>\\nxcerpt \\n<bold>S</b>\\negmentation and \\n<bold>I</b>\\ndentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 4\",\"pages\":\"3001-3016\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10464379/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10464379/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在复杂的长期新闻视频中,最基本的组成部分是由许多演播室和采访画面组成的新闻节选。从如此复杂的长期视频中发现并识别正确的新闻节选是一项极具挑战性的任务。除了固有的时间语义和复杂的一般事件交互之外,文本和视觉模式中丰富多样的语义也使问题变得更加复杂。在本文中,我们深入探讨了视频时间理解的细微差别,并从多模态和多任务的角度对其进行了研究。我们的研究涉及提出一个更精细的挑战,我们称之为多模态新闻摘录分割和识别。我们的目标是将新闻视频分割成单个帧级摘录,同时利用多模态语义为每个片段准确分配精心制作的标签。由于目前缺乏多模态精细时间分割数据集,我们建立了一个名为 News-MESI 的新基准来支持我们的研究。News-MESI 包含 150 多个来自数字媒体的高质量新闻视频,总时长约 150 小时,包含 2000 多个新闻节选。通过使用帧级摘录边界和精心设计的分类层次对其进行注释,该视频集为多模态语义理解这些与众不同的视频提供了宝贵的机会。我们还提出了一种新颖的算法,采用从粗到细的多模态融合和分层分类来解决这一问题。我们在基准上进行了广泛的实验,以展示新闻内容在本质上是如何随时间演变的。进一步的分析表明,多模态解决方案明显优于单模态解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
News-MESI: A Dataset for Multimodal News Excerpt Segmentation and Identification
In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as M ultimodal News E xcerpt S egmentation and I dentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
10.30
自引率
7.50%
发文量
147
期刊介绍: The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信