{"title":"语音情感识别中调制谱特征的噪声-混响-鲁棒性研究","authors":"Taiyang Guo, Sixia Li, M. Unoki, S. Okada","doi":"10.23919/APSIPAASC55919.2022.9980032","DOIUrl":null,"url":null,"abstract":"Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition\",\"authors\":\"Taiyang Guo, Sixia Li, M. Unoki, S. Okada\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition
Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.