Towards Long Form Audio-visual Video Understanding

IF 6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-06-07 DOI:10.1145/3672079

Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu

{"title":"Towards Long Form Audio-visual Video Understanding","authors":"Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu","doi":"10.1145/3672079","DOIUrl":null,"url":null,"abstract":"<p>We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"134 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3672079","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.

查看原文本刊更多论文

实现对长篇视听视频的理解

我们生活的世界充满了永无止境的多模态信息流。作为对真实场景更自然的记录，长视频有望成为更好地探索和理解世界的重要桥梁。在本文中，我们提出了长视频中的多感官时间事件定位任务，并努力解决相关挑战。为了促进这项研究，我们首先收集了一个大规模的长格式视听视频（LFAV）数据集，其中包含 5,175 个视频，平均视频长度为 210 秒。每个收集到的视频都按照长时序精心注释了多种感知模态的事件。然后，我们提出了一个以事件为中心的框架，用于定位多感官事件并理解它们在长视频中的关系。它包括三个不同层次的阶段：学习片段特征的片段预测阶段、提取事件级特征的事件提取阶段以及研究事件关系的事件交互阶段。实验证明，利用新的 LFAV 数据集，所提出的方法在定位长视频中的多种模式感知事件方面表现出了相当高的效率。我们希望我们新收集的数据集和新颖的方法能成为进一步研究长视频视听理解领域的基石。项目页面：https://gewu-lab.github.io/LFAV/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.