AudioMask:使用掩码R-CNN和帧级分类器的鲁棒声音事件检测

2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) Pub Date : 2019-11-01 DOI:10.1109/ICTAI.2019.00074

Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu

{"title":"AudioMask:使用掩码R-CNN和帧级分类器的鲁棒声音事件检测","authors":"Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu","doi":"10.1109/ICTAI.2019.00074","DOIUrl":null,"url":null,"abstract":"Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.","PeriodicalId":346657,"journal":{"name":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"38 8 Pt 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier\",\"authors\":\"Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu\",\"doi\":\"10.1109/ICTAI.2019.00074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.\",\"PeriodicalId\":346657,\"journal\":{\"name\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"38 8 Pt 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2019.00074\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2019.00074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

深度学习方法最近在声音事件检测方面做出了重大贡献。这些方法要么使用块级方法来区分包含事件的音频部分，要么单独分析音频的小帧。本文将这两种方法相结合，提出了一种新的罕见声事件检测方法——AudioMask。AudioMask首先将Mask R-CNN(一种用于检测图像中物体的最先进算法)应用于音频文件的对数梅尔谱图。掩码R-CNN检测音频片段，可能包含目标事件产生包围框周围的时间-频率域。然后，我们使用独立于Mask R-CNN训练的基于帧的音频事件分类器来分析Mask R-CNN提出的候选片段中的每个单独帧。后处理步骤结合掩码R-CNN和帧级分类器的输出来识别真实事件。通过在2017年声学场景和事件检测和分类(DCASE)挑战任务2的数据集上评估AudioMask，我们发现我们的算法在平均f分数上比基线模型好13.3%，并且与挑战中的其他非集成方法相比取得了更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier

Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)

自引率

0.00%

发文量