Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu
{"title":"News-MESI: A Dataset for Multimodal News Excerpt Segmentation and Identification","authors":"Qing Song;Zilong Jia;Wenhe Jia;Wenyi Zhao;Mengjie Hu;Chun Liu","doi":"10.1109/TETCI.2024.3369866","DOIUrl":null,"url":null,"abstract":"In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as \n<bold>M</b>\nultimodal News \n<bold>E</b>\nxcerpt \n<bold>S</b>\negmentation and \n<bold>I</b>\ndentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 4","pages":"3001-3016"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10464379/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In complex long-term news videos, the fundamental component is the news excerpt which consists of many studio and interview screens. Spotting and identifying the correct news excerpt from such a complex long-term video is a challenging task. Apart from the inherent temporal semantics and the complex generic events interactions, the varied richness of semantics within the text and visual modalities further complicates matters. In this paper, we delve into the nuanced realm of video temporal understanding, examining it through a multimodal and multitask perspective. Our research involves presenting a more fine-grained challenge, which we refer to as
M
ultimodal News
E
xcerpt
S
egmentation and
I
dentification. The objective is to segment news videos into individual frame-level excerpts while accurately assigning elaborate tags to each segment by utilizing multimodal semantics. As there is an absence of multimodal fine-grained temporal segmentation dataset at present, we set up a new benchmark called News-MESI to support our research. News-MESI comprises over 150 high-quality news videos sourced from digital media, with approximately 150 hours in total and encompassing more than 2000 news excerpts. By annotating it with frame-level excerpt boundaries and an elaborate categorization hierarchy, this collection offers a valuable chance for multi-modal semantic understanding of these distinctive videos. We also present a novel algorithm employing coarse-to-fine multimodal fusion and hierarchical classification to address this problem. Extensive experiments are executed on our benchmark to show how the news content is temporally evolving in nature. Further analysis shows that multi-modal solutions are significantly superior to the single-modal solution.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.