基于深度视听融合网络的稳健说话人跟踪

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-24 DOI:10.1109/TMM.2024.3521737

Yidi Li;Hong Liu;Bing Yang

{"title":"基于深度视听融合网络的稳健说话人跟踪","authors":"Yidi Li;Hong Liu;Bing Yang","doi":"10.1109/TMM.2024.3521737","DOIUrl":null,"url":null,"abstract":"Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1835-1847"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"STNet: Deep Audio–Visual Fusion Network for Robust Speaker Tracking\",\"authors\":\"Yidi Li;Hong Liu;Bing Yang\",\"doi\":\"10.1109/TMM.2024.3521737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"1835-1847\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10814658/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814658/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

视听说话人跟踪的目的是利用多传感器平台捕获的信号确定场景中人类目标的位置，多模态融合方法可以提高该平台的精度和鲁棒性。最近，人们提出了几种融合方法来模拟多模态的相关性。然而，对于说话人跟踪问题，音频和视觉信号之间的跨模态交互作用尚未得到很好的利用。为此，本文提出了一种新的基于深度视听融合模型的说话人跟踪网络（STNet）。我们设计了一种视觉引导的声学测量方法，在统一的定位空间中融合异构线索，通过相机模型的视觉观测来构建增强的声学地图。在特征融合方面，采用跨模态注意模块对多模态上下文和交互进行联合建模。在融合模型中，音频和视觉特征之间的相关信息进一步交互。此外，基于stnet的跟踪器通过质量感知模块应用于多说话人的情况，该模块评估多模态观测的可靠性，以实现复杂场景下的鲁棒跟踪。在AV16.3和CAV3D数据集上的实验表明，基于stnet的跟踪器优于单模态方法和最先进的视听说话人跟踪器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

STNet: Deep Audio–Visual Fusion Network for Robust Speaker Tracking

Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.