{"title":"重叠声音事件检测的特征级注意力池","authors":"Mei Wang, Zhengyi An, Yuh-Lin Yao, Xiyu Song","doi":"10.1109/NaNA56854.2022.00098","DOIUrl":null,"url":null,"abstract":"The aim of overlapping sound event detection (SED) is to detect the sound events classes contained and the corresponding time stamps in audio. In real life, the sound events contained in a piece of audio are usually very complicated, and various sound events overlap and alternate, which greatly increases the difficulty of overlapping SED. In our past research, we found that the attention mechanism can effectively cope with the mentioned difficulty, as long as its need for reinforcement training on a large amount of strongly labeled data is satisfied. However, strongly labeled data is usually not easy to obtain, so how to use weakly labeled data has become mainstream. Against this background, we propose a new overlapping SED method that can effectively utilize weakly labeled data through the use of the feature-level attention pooling strategy under the multiple-instance learning (MIL) framework. Moreover, a feature-level attention pooling strategy with different learnable parameters is studied and compared with a decision-level attention pooling strategy in our work, so the advantages of attention with a feature-level attention pooling strategy in overlapping SED can be better shown. We compare the classification results of overlapping SED using feature-level and decision-level attention pooling in the DCASE 2021 task 4 scenario (Sound Event Detection and Separation in Domestic Environments). Test results show that the PSDS1 (focus on the sensitiveness to the detection speed) of feature-level attention pooling is kept unchanged (it means the learnable parameters training does not burden the overlapping SED), but the PSDS2 (focus on the classification effect) is improved by 3%, which is slightly above the score generated from the decision-level attention pooling (it means feature-level attention pooling has better returns). Although the results are only slightly improved when compared to the decision-level attention pooling, it has somehow improved research guidance for the research of improving the classification results of overlapping SED. Our conclusion is that feature-level attention pooling is good for exploiting weak-label data, and can achieve better classification without increasing computational complexity.","PeriodicalId":113743,"journal":{"name":"2022 International Conference on Networking and Network Applications (NaNA)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature-level Attention Pooling for Overlapping Sound Event Detection\",\"authors\":\"Mei Wang, Zhengyi An, Yuh-Lin Yao, Xiyu Song\",\"doi\":\"10.1109/NaNA56854.2022.00098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The aim of overlapping sound event detection (SED) is to detect the sound events classes contained and the corresponding time stamps in audio. In real life, the sound events contained in a piece of audio are usually very complicated, and various sound events overlap and alternate, which greatly increases the difficulty of overlapping SED. In our past research, we found that the attention mechanism can effectively cope with the mentioned difficulty, as long as its need for reinforcement training on a large amount of strongly labeled data is satisfied. However, strongly labeled data is usually not easy to obtain, so how to use weakly labeled data has become mainstream. Against this background, we propose a new overlapping SED method that can effectively utilize weakly labeled data through the use of the feature-level attention pooling strategy under the multiple-instance learning (MIL) framework. Moreover, a feature-level attention pooling strategy with different learnable parameters is studied and compared with a decision-level attention pooling strategy in our work, so the advantages of attention with a feature-level attention pooling strategy in overlapping SED can be better shown. We compare the classification results of overlapping SED using feature-level and decision-level attention pooling in the DCASE 2021 task 4 scenario (Sound Event Detection and Separation in Domestic Environments). Test results show that the PSDS1 (focus on the sensitiveness to the detection speed) of feature-level attention pooling is kept unchanged (it means the learnable parameters training does not burden the overlapping SED), but the PSDS2 (focus on the classification effect) is improved by 3%, which is slightly above the score generated from the decision-level attention pooling (it means feature-level attention pooling has better returns). Although the results are only slightly improved when compared to the decision-level attention pooling, it has somehow improved research guidance for the research of improving the classification results of overlapping SED. Our conclusion is that feature-level attention pooling is good for exploiting weak-label data, and can achieve better classification without increasing computational complexity.\",\"PeriodicalId\":113743,\"journal\":{\"name\":\"2022 International Conference on Networking and Network Applications (NaNA)\",\"volume\":\"134 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Networking and Network Applications (NaNA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NaNA56854.2022.00098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Networking and Network Applications (NaNA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NaNA56854.2022.00098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature-level Attention Pooling for Overlapping Sound Event Detection
The aim of overlapping sound event detection (SED) is to detect the sound events classes contained and the corresponding time stamps in audio. In real life, the sound events contained in a piece of audio are usually very complicated, and various sound events overlap and alternate, which greatly increases the difficulty of overlapping SED. In our past research, we found that the attention mechanism can effectively cope with the mentioned difficulty, as long as its need for reinforcement training on a large amount of strongly labeled data is satisfied. However, strongly labeled data is usually not easy to obtain, so how to use weakly labeled data has become mainstream. Against this background, we propose a new overlapping SED method that can effectively utilize weakly labeled data through the use of the feature-level attention pooling strategy under the multiple-instance learning (MIL) framework. Moreover, a feature-level attention pooling strategy with different learnable parameters is studied and compared with a decision-level attention pooling strategy in our work, so the advantages of attention with a feature-level attention pooling strategy in overlapping SED can be better shown. We compare the classification results of overlapping SED using feature-level and decision-level attention pooling in the DCASE 2021 task 4 scenario (Sound Event Detection and Separation in Domestic Environments). Test results show that the PSDS1 (focus on the sensitiveness to the detection speed) of feature-level attention pooling is kept unchanged (it means the learnable parameters training does not burden the overlapping SED), but the PSDS2 (focus on the classification effect) is improved by 3%, which is slightly above the score generated from the decision-level attention pooling (it means feature-level attention pooling has better returns). Although the results are only slightly improved when compared to the decision-level attention pooling, it has somehow improved research guidance for the research of improving the classification results of overlapping SED. Our conclusion is that feature-level attention pooling is good for exploiting weak-label data, and can achieve better classification without increasing computational complexity.