AudioMask:使用掩码R-CNN和帧级分类器的鲁棒声音事件检测

Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu
{"title":"AudioMask:使用掩码R-CNN和帧级分类器的鲁棒声音事件检测","authors":"Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu","doi":"10.1109/ICTAI.2019.00074","DOIUrl":null,"url":null,"abstract":"Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.","PeriodicalId":346657,"journal":{"name":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"38 8 Pt 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier\",\"authors\":\"Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu\",\"doi\":\"10.1109/ICTAI.2019.00074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.\",\"PeriodicalId\":346657,\"journal\":{\"name\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"38 8 Pt 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2019.00074\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2019.00074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

深度学习方法最近在声音事件检测方面做出了重大贡献。这些方法要么使用块级方法来区分包含事件的音频部分,要么单独分析音频的小帧。本文将这两种方法相结合,提出了一种新的罕见声事件检测方法——AudioMask。AudioMask首先将Mask R-CNN(一种用于检测图像中物体的最先进算法)应用于音频文件的对数梅尔谱图。掩码R-CNN检测音频片段,可能包含目标事件产生包围框周围的时间-频率域。然后,我们使用独立于Mask R-CNN训练的基于帧的音频事件分类器来分析Mask R-CNN提出的候选片段中的每个单独帧。后处理步骤结合掩码R-CNN和帧级分类器的输出来识别真实事件。通过在2017年声学场景和事件检测和分类(DCASE)挑战任务2的数据集上评估AudioMask,我们发现我们的算法在平均f分数上比基线模型好13.3%,并且与挑战中的其他非集成方法相比取得了更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier
Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信