Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang
{"title":"Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods","authors":"Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang","doi":"10.23919/APSIPAASC55919.2022.9980120","DOIUrl":null,"url":null,"abstract":"How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.