Varun Sai Alaparthi, Tejeswara Reddy Pasam, Deepak Abhiram Inagandla, J. Prakash, P. Singh
{"title":"ScSer: Supervised Contrastive Learning for Speech Emotion Recognition using Transformers","authors":"Varun Sai Alaparthi, Tejeswara Reddy Pasam, Deepak Abhiram Inagandla, J. Prakash, P. Singh","doi":"10.1109/HSI55341.2022.9869453","DOIUrl":null,"url":null,"abstract":"Emotion recognition from the speech is a key challenging task and an active area of research in effective Human-Computer Interaction (HCI). Though many deep learning and machine learning approaches have been proposed to tackle the problem, they lack in both accuracy and learning robust representations agnostic to changes in voice. Additionally, there is a lack of sufficient labelled speech data for bigger models. To overcome these issues, we propose supervised contrastive learning with transformers for the task of speech emotion recognition (ScSer) and evaluate it on different standard datasets. Further, we experiment the supervised contrastive setting with different augmentations from WavAugment library and some custom augmentations. Finally, we propose a custom augmentation random cyclic shift with which ScSer outperforms other competitive methods and produce a state of the art accuracy of 96% on RAVDESS dataset with 7600 samples (Big-Ravdess) and a 2-4% boost over other wav2vec methods.","PeriodicalId":282607,"journal":{"name":"2022 15th International Conference on Human System Interaction (HSI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 15th International Conference on Human System Interaction (HSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HSI55341.2022.9869453","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Emotion recognition from the speech is a key challenging task and an active area of research in effective Human-Computer Interaction (HCI). Though many deep learning and machine learning approaches have been proposed to tackle the problem, they lack in both accuracy and learning robust representations agnostic to changes in voice. Additionally, there is a lack of sufficient labelled speech data for bigger models. To overcome these issues, we propose supervised contrastive learning with transformers for the task of speech emotion recognition (ScSer) and evaluate it on different standard datasets. Further, we experiment the supervised contrastive setting with different augmentations from WavAugment library and some custom augmentations. Finally, we propose a custom augmentation random cyclic shift with which ScSer outperforms other competitive methods and produce a state of the art accuracy of 96% on RAVDESS dataset with 7600 samples (Big-Ravdess) and a 2-4% boost over other wav2vec methods.