基于层次时间聚合的自一致性训练用于声音事件检测

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2022-11-07 DOI:10.23919/APSIPAASC55919.2022.9980285

Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu

{"title":"基于层次时间聚合的自一致性训练用于声音事件检测","authors":"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu","doi":"10.23919/APSIPAASC55919.2022.9980285","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection\",\"authors\":\"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980285\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980285\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种基于自一致性训练(SCT)策略和分层时间聚合(HTA)模块的声音事件检测(SED)方法，命名为SCT-HTA。该方法采用均值教师(Mean Teacher, MT)半监督学习方法，利用包含主分支和辅助分支的双分支卷积递归神经网络(CRNN)结构。我们采用SCT策略，除了MT损失外，还应用自一致性正则化来保持辅助分支和主分支输出之间的一致性。此外，还设计了一个HTA模块来聚合不同时间分辨率的信息。我们还探索了三种聚合器用于HTA模块，四种池化方法组合用于两个分支的定位模块。实验结果表明，我们提出的SCT-HTA方法优于四种比较方法。结果表明，最大池聚合器具有较好的突出声音事件位置的能力。而“线性softmax +注意力”组合的池化方法达到了最佳的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection

In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量