声学场景分类的多特征收敛网络

Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering Pub Date : 2022-10-21 DOI:10.1145/3573428.3573633

Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang

{"title":"声学场景分类的多特征收敛网络","authors":"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang","doi":"10.1145/3573428.3573633","DOIUrl":null,"url":null,"abstract":"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.","PeriodicalId":314698,"journal":{"name":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Feature Convergence Network for Acoustic Scene Classification\",\"authors\":\"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang\",\"doi\":\"10.1145/3573428.3573633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.\",\"PeriodicalId\":314698,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573428.3573633\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573428.3573633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

研究了一种用于声学场景分类的多特征收敛网络。在通道上叠加Log Mel谱图、delta和delta - delta特征，设计了一系列神经网络模型，取得了很好的分类效果。然而，从音频信号中提取的语音谱图特征的低频部分由于其低分辨率而呈现马赛克形状，导致Log mel - delta - deltadeltas特征的低频部分信息丢失，降低了分类精度。为了解决这一问题，引入了常数q变换(constant q transform, CQT)谱图，并将该特征与log mel - delta - deltadeltas特征叠加在信道上，形成一个4通道特征谱，作为网络模型的输入。此外，将所提出的网络模型从基线系统的8个残差块增加到10个残差块，并对训练过程中保存的各种模型进行快照整合操作，因为这些模型具有互补信息。然后，在ASC的主要分类场景的10分类器的基础上增加一个3分类器，并结合3-10两阶段分类得分选择最终的场景分类。我们提出的网络的分类准确率达到了77.4%，比本文设置的基线系统高5.1%，比DCASE 2020官网的基线高26%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Feature Convergence Network for Acoustic Scene Classification

This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering

自引率

0.00%

发文量