{"title":"基于层次时间聚合的自一致性训练用于声音事件检测","authors":"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu","doi":"10.23919/APSIPAASC55919.2022.9980285","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection\",\"authors\":\"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection
In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.