{"title":"基于多尺度注意法的弱监督声音事件检测联合检测分类模型","authors":"Yaoguang Wang, Liang He","doi":"10.1109/ISSPIT51521.2020.9408948","DOIUrl":null,"url":null,"abstract":"Attention mechanism has been applied to the weakly supervised sound event detection (SED) and has achieved state-of-the-art performance, but most methods only concentrate along the time axis. In this paper, we propose the multi-scale time-frequency attention (MTFA) method to capture the intrinsic features at different scales both in time and frequency domain for audio tagging (AT) and SED. Our model is a unified network which can perform AT and SED simultaneously, it produces multi-scale attention-aware representations for SED with MTFA module, and a global pooling module maps the representations to presence probability of corresponding audio event for AT. To evaluate the proposed method, we conduct experiments on Task4 of Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, and it achieves 57.9% (F1-score) in AT task and 0.71 (error rate) in SED task on evaluation set, which is comparable to the state-of-the-art results in the challenge.","PeriodicalId":111385,"journal":{"name":"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method\",\"authors\":\"Yaoguang Wang, Liang He\",\"doi\":\"10.1109/ISSPIT51521.2020.9408948\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Attention mechanism has been applied to the weakly supervised sound event detection (SED) and has achieved state-of-the-art performance, but most methods only concentrate along the time axis. In this paper, we propose the multi-scale time-frequency attention (MTFA) method to capture the intrinsic features at different scales both in time and frequency domain for audio tagging (AT) and SED. Our model is a unified network which can perform AT and SED simultaneously, it produces multi-scale attention-aware representations for SED with MTFA module, and a global pooling module maps the representations to presence probability of corresponding audio event for AT. To evaluate the proposed method, we conduct experiments on Task4 of Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, and it achieves 57.9% (F1-score) in AT task and 0.71 (error rate) in SED task on evaluation set, which is comparable to the state-of-the-art results in the challenge.\",\"PeriodicalId\":111385,\"journal\":{\"name\":\"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSPIT51521.2020.9408948\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSPIT51521.2020.9408948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method
Attention mechanism has been applied to the weakly supervised sound event detection (SED) and has achieved state-of-the-art performance, but most methods only concentrate along the time axis. In this paper, we propose the multi-scale time-frequency attention (MTFA) method to capture the intrinsic features at different scales both in time and frequency domain for audio tagging (AT) and SED. Our model is a unified network which can perform AT and SED simultaneously, it produces multi-scale attention-aware representations for SED with MTFA module, and a global pooling module maps the representations to presence probability of corresponding audio event for AT. To evaluate the proposed method, we conduct experiments on Task4 of Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, and it achieves 57.9% (F1-score) in AT task and 0.71 (error rate) in SED task on evaluation set, which is comparable to the state-of-the-art results in the challenge.