{"title":"声学场景分类的多特征收敛网络","authors":"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang","doi":"10.1145/3573428.3573633","DOIUrl":null,"url":null,"abstract":"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.","PeriodicalId":314698,"journal":{"name":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Feature Convergence Network for Acoustic Scene Classification\",\"authors\":\"Menglong Wu, Hongxia Dong, Xichang Cai, Ziling Qiao, Cuizhu Qin, Lin Zhang\",\"doi\":\"10.1145/3573428.3573633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.\",\"PeriodicalId\":314698,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573428.3573633\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573428.3573633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Feature Convergence Network for Acoustic Scene Classification
This paper investigates a multi-feature convergence network for acoustic scene classification (ASC). A series of neural network models designed with features of the Log Mel spectrogram, Deltas, and Delta-Deltas superimposed on the channel have achieved good classification results. However, the low-frequency part of the speech spectrogram feature extracted from the audio signal has a mosaic shape due to its low resolution, which leads to the loss of information in the low-frequency part of the Log Mel-Deltas-DeltaDeltas feature and reduces the classification accuracy. To solve this problem, the constant Q-transform (CQT) spectrogram is introduced and this feature is superimposed on the channel with the log Mel-Deltas-DeltaDeltas feature to form a 4-channel feature spectrum as the input to the network model. Moreover, the proposed network model is deepened by increasing the 8 residual blocks from the baseline system to 10 residual blocks and a snapshot integration operation is performed on the various models saved during the training process due to the complementary information. And then, a 3-classifier is added based on the ASC's primarily categorized scenes' 10-classifier and chooses the final scene classification by combining the 3–10 two-stage classification scores. The classification accuracy of our proposed network reached 77.4%, which is 5.1% higher than the baseline system set in this paper and 26% higher than the baseline on the official website of DCASE 2020.