CRATI: Contrastive representation-based multimodal sound event localization and detection

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2024-11-04 DOI:10.1016/j.knosys.2024.112692

Shichao Wu , Yongru Wang , Yushan Jiang , Qianyi Zhang , Jingtai Liu

{"title":"CRATI: Contrastive representation-based multimodal sound event localization and detection","authors":"Shichao Wu , Yongru Wang , Yushan Jiang , Qianyi Zhang , Jingtai Liu","doi":"10.1016/j.knosys.2024.112692","DOIUrl":null,"url":null,"abstract":"<div><div>Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02<span><math><mo>°</mo></math></span> on the STARSS23 dataset, respectively.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"305 ","pages":"Article 112692"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013261","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02

°

on the STARSS23 dataset, respectively.

查看原文本刊更多论文

CRATI：基于对比表示的多模态声音事件定位和检测

声音事件定位和检测（SELD）是指在同一多通道音频上用声学模型对声音类别进行分类并定位其位置。近来，SELD 利用其他研究领域的先进方法迅速发展，提供的基准 SELD 数据集也越来越逼真，可以同时捕获视频。振动会产生声音，我们通常会将视觉对象与声音联系在一起，例如，我们会听到走路的人发出的脚步声，听到跑步的铃铛发出的叮当声。自然而然地，我们就会想到利用多模态信息（图像-音频-文本与单纯音频）来提高声音事件检测（SED）的准确性，减少声源定位（SSL）误差。在本文中，我们为 SELD 提出了一种基于对比度表示的多模态声学模型（CRATI），该模型旨在以端到端的方式从音频、文本和图像中学习对比度音频表示。在 STARSS23 的真实数据集和 TAU-NIGENS Spatial Sound Events 2021 的合成数据集上进行的实验都表明，我们的 CRATI 模型可以学习到更有效的音频特征，并通过额外的约束条件将音频和文本（本文中为 SED 和 SSL 注释）之间的差异最小化。图像输入不利于提高 SELD 的性能，因为只能从连续帧中观察到微小的视觉变化。与基线系统相比，我们的模型在 STARSS23 数据集上分别将 SED F 分数提高了 11%，将 SSL 误差降低了 31.02°。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.