Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information (Switzerland) Pub Date : 2023-09-30 DOI:10.3390/info14100534

Grigorios-Aris Cheimariotis, Nikolaos Mitianoudis

{"title":"Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention","authors":"Grigorios-Aris Cheimariotis, Nikolaos Mitianoudis","doi":"10.3390/info14100534","DOIUrl":null,"url":null,"abstract":"This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance.","PeriodicalId":38479,"journal":{"name":"Information (Switzerland)","volume":"1 1","pages":"0"},"PeriodicalIF":2.9000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information (Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14100534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance.

查看原文本刊更多论文

基于频率动态卷积和局部注意的环境声事件检测

这项工作描述了一种在家庭环境中检测声音事件的方法。这项任务的有效解决方案可以支持老年人的自主生活。该方法处理2023年的“声学场景和事件的检测和分类挑战(DCASE)”，更具体地说，是任务4a“家庭活动的声音事件检测”。这项任务包括在10秒的声音片段中检测10个家庭环境中常见的事件。事件可以在10秒剪辑中具有任意的持续时间。该方法的主要组成部分是对代表声音片段的mel-谱图进行数据增强，通过频率动态卷积网络传递谱图(每个卷积都有一个额外的注意模块)来提取特征，将这些特征与BEATs嵌入连接起来，并使用BiGRU进行序列建模。此外，平均教师模型被用于利用未标记的数据。本研究的重点是数据增强技术、特征提取模型和自监督学习的效果。本文的主要贡献是提出的特征提取模型，该模型在每个卷积中对频率进行加权关注，并依次与计算机视觉采用的局部关注模块相结合。该系统具有良好的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊