Shichao Wu , Yongru Wang , Yushan Jiang , Qianyi Zhang , Jingtai Liu
{"title":"CRATI:基于对比表示的多模态声音事件定位和检测","authors":"Shichao Wu , Yongru Wang , Yushan Jiang , Qianyi Zhang , Jingtai Liu","doi":"10.1016/j.knosys.2024.112692","DOIUrl":null,"url":null,"abstract":"<div><div>Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02<span><math><mo>°</mo></math></span> on the STARSS23 dataset, respectively.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"305 ","pages":"Article 112692"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CRATI: Contrastive representation-based multimodal sound event localization and detection\",\"authors\":\"Shichao Wu , Yongru Wang , Yushan Jiang , Qianyi Zhang , Jingtai Liu\",\"doi\":\"10.1016/j.knosys.2024.112692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02<span><math><mo>°</mo></math></span> on the STARSS23 dataset, respectively.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"305 \",\"pages\":\"Article 112692\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124013261\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013261","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
CRATI: Contrastive representation-based multimodal sound event localization and detection
Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02 on the STARSS23 dataset, respectively.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.