Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa
{"title":"使用视觉变形器在人机交互中进行个性化语音情感识别","authors":"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa","doi":"arxiv-2409.10687","DOIUrl":null,"url":null,"abstract":"Emotions are an essential element in verbal communication, so understanding\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\nThis paper investigates the application of vision transformer models, namely\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\ngeneralize the SER models for individual speech characteristics by fine-tuning\nthese models on benchmark datasets and exploiting ensemble methods. For this\npurpose, we collected audio data from different human subjects having\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\nViT and BEiT-based models and tested these models on unseen speech samples from\nthe participants. In the results, we show that fine-tuning vision transformers\non benchmark datasets and and then using either these already fine-tuned models\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\nindividual when it comes to identifying four primary emotions from their\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\nor BEiTs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers\",\"authors\":\"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa\",\"doi\":\"arxiv-2409.10687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions are an essential element in verbal communication, so understanding\\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\\nThis paper investigates the application of vision transformer models, namely\\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\\ngeneralize the SER models for individual speech characteristics by fine-tuning\\nthese models on benchmark datasets and exploiting ensemble methods. For this\\npurpose, we collected audio data from different human subjects having\\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\\nViT and BEiT-based models and tested these models on unseen speech samples from\\nthe participants. In the results, we show that fine-tuning vision transformers\\non benchmark datasets and and then using either these already fine-tuned models\\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\\nindividual when it comes to identifying four primary emotions from their\\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\\nor BEiTs.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本文研究了视觉变换器模型,即ViT(视觉变换器)和BEiT(图像变换器的BERT预训练)管道在人机交互中的语音情感识别(SER)应用。重点是通过在基准数据集上对这些模型进行微调,并利用集合方法,针对单个语音特征对 SER 模型进行泛化。为此,我们收集了不同人类受试者与NAO机器人进行伪自然对话的音频数据。然后,我们对基于ViT和BEiT的模型进行了微调,并在受试者未见过的语音样本上对这些模型进行了测试。结果表明,与微调ViTs或BEiTs相比,微调基准数据集的视觉变换,然后使用这些已微调的模型或ViT/BEiT模型的集合,在从语音中识别四种主要情绪(中性、快乐、悲伤和愤怒)时,每个人的分类准确率最高。
Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers
Emotions are an essential element in verbal communication, so understanding
individuals' affect during a human-robot interaction (HRI) becomes imperative.
This paper investigates the application of vision transformer models, namely
ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)
pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to
generalize the SER models for individual speech characteristics by fine-tuning
these models on benchmark datasets and exploiting ensemble methods. For this
purpose, we collected audio data from different human subjects having
pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our
ViT and BEiT-based models and tested these models on unseen speech samples from
the participants. In the results, we show that fine-tuning vision transformers
on benchmark datasets and and then using either these already fine-tuned models
or ensembling ViT/BEiT models gets us the highest classification accuracies per
individual when it comes to identifying four primary emotions from their
speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs
or BEiTs.