声学场景分类的多特征收敛网络

Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang
{"title":"声学场景分类的多特征收敛网络","authors":"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang","doi":"10.1145/3573428.3573633","DOIUrl":null,"url":null,"abstract":"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.","PeriodicalId":314698,"journal":{"name":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Feature Convergence Network for Acoustic Scene Classification\",\"authors\":\"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang\",\"doi\":\"10.1145/3573428.3573633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.\",\"PeriodicalId\":314698,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573428.3573633\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573428.3573633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

研究了一种用于声学场景分类的多特征收敛网络。在通道上叠加Log Mel谱图、delta和delta - delta特征,设计了一系列神经网络模型,取得了很好的分类效果。然而,从音频信号中提取的语音谱图特征的低频部分由于其低分辨率而呈现马赛克形状,导致Log mel - delta - deltadeltas特征的低频部分信息丢失,降低了分类精度。为了解决这一问题,引入了常数q变换(constant q transform, CQT)谱图,并将该特征与log mel - delta - deltadeltas特征叠加在信道上,形成一个4通道特征谱,作为网络模型的输入。此外,将所提出的网络模型从基线系统的8个残差块增加到10个残差块,并对训练过程中保存的各种模型进行快照整合操作,因为这些模型具有互补信息。然后,在ASC的主要分类场景的10分类器的基础上增加一个3分类器,并结合3-10两阶段分类得分选择最终的场景分类。我们提出的网络的分类准确率达到了77.4%,比本文设置的基线系统高5.1%,比DCASE 2020官网的基线高26%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-Feature Convergence Network for Acoustic Scene Classification
This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信