Varun Sai Alaparthi, Tejeswara Reddy Pasam, Deepak Abhiram Inagandla, J. Prakash, P. Singh
{"title":"语音情感识别的监督对比学习","authors":"Varun Sai Alaparthi, Tejeswara Reddy Pasam, Deepak Abhiram Inagandla, J. Prakash, P. Singh","doi":"10.1109/HSI55341.2022.9869453","DOIUrl":null,"url":null,"abstract":"Emotion recognition from the speech is a key challenging task and an active area of research in effective Human-Computer Interaction (HCI). Though many deep learning and machine learning approaches have been proposed to tackle the problem, they lack in both accuracy and learning robust representations agnostic to changes in voice. Additionally, there is a lack of sufficient labelled speech data for bigger models. To overcome these issues, we propose supervised contrastive learning with transformers for the task of speech emotion recognition (ScSer) and evaluate it on different standard datasets. Further, we experiment the supervised contrastive setting with different augmentations from WavAugment library and some custom augmentations. Finally, we propose a custom augmentation random cyclic shift with which ScSer outperforms other competitive methods and produce a state of the art accuracy of 96% on RAVDESS dataset with 7600 samples (Big-Ravdess) and a 2-4% boost over other wav2vec methods.","PeriodicalId":282607,"journal":{"name":"2022 15th International Conference on Human System Interaction (HSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"ScSer: Supervised Contrastive Learning for Speech Emotion Recognition using Transformers\",\"authors\":\"Varun Sai Alaparthi, Tejeswara Reddy Pasam, Deepak Abhiram Inagandla, J. Prakash, P. Singh\",\"doi\":\"10.1109/HSI55341.2022.9869453\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotion recognition from the speech is a key challenging task and an active area of research in effective Human-Computer Interaction (HCI). Though many deep learning and machine learning approaches have been proposed to tackle the problem, they lack in both accuracy and learning robust representations agnostic to changes in voice. Additionally, there is a lack of sufficient labelled speech data for bigger models. To overcome these issues, we propose supervised contrastive learning with transformers for the task of speech emotion recognition (ScSer) and evaluate it on different standard datasets. Further, we experiment the supervised contrastive setting with different augmentations from WavAugment library and some custom augmentations. Finally, we propose a custom augmentation random cyclic shift with which ScSer outperforms other competitive methods and produce a state of the art accuracy of 96% on RAVDESS dataset with 7600 samples (Big-Ravdess) and a 2-4% boost over other wav2vec methods.\",\"PeriodicalId\":282607,\"journal\":{\"name\":\"2022 15th International Conference on Human System Interaction (HSI)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 15th International Conference on Human System Interaction (HSI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HSI55341.2022.9869453\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 15th International Conference on Human System Interaction (HSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HSI55341.2022.9869453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ScSer: Supervised Contrastive Learning for Speech Emotion Recognition using Transformers
Emotion recognition from the speech is a key challenging task and an active area of research in effective Human-Computer Interaction (HCI). Though many deep learning and machine learning approaches have been proposed to tackle the problem, they lack in both accuracy and learning robust representations agnostic to changes in voice. Additionally, there is a lack of sufficient labelled speech data for bigger models. To overcome these issues, we propose supervised contrastive learning with transformers for the task of speech emotion recognition (ScSer) and evaluate it on different standard datasets. Further, we experiment the supervised contrastive setting with different augmentations from WavAugment library and some custom augmentations. Finally, we propose a custom augmentation random cyclic shift with which ScSer outperforms other competitive methods and produce a state of the art accuracy of 96% on RAVDESS dataset with 7600 samples (Big-Ravdess) and a 2-4% boost over other wav2vec methods.