使用视觉变形器在人机交互中进行个性化语音情感识别

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI:arxiv-2409.10687

Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa

{"title":"使用视觉变形器在人机交互中进行个性化语音情感识别","authors":"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa","doi":"arxiv-2409.10687","DOIUrl":null,"url":null,"abstract":"Emotions are an essential element in verbal communication, so understanding\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\nThis paper investigates the application of vision transformer models, namely\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\ngeneralize the SER models for individual speech characteristics by fine-tuning\nthese models on benchmark datasets and exploiting ensemble methods. For this\npurpose, we collected audio data from different human subjects having\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\nViT and BEiT-based models and tested these models on unseen speech samples from\nthe participants. In the results, we show that fine-tuning vision transformers\non benchmark datasets and and then using either these already fine-tuned models\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\nindividual when it comes to identifying four primary emotions from their\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\nor BEiTs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers\",\"authors\":\"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa\",\"doi\":\"arxiv-2409.10687\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions are an essential element in verbal communication, so understanding\\nindividuals' affect during a human-robot interaction (HRI) becomes imperative.\\nThis paper investigates the application of vision transformer models, namely\\nViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\\npipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\\ngeneralize the SER models for individual speech characteristics by fine-tuning\\nthese models on benchmark datasets and exploiting ensemble methods. For this\\npurpose, we collected audio data from different human subjects having\\npseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\\nViT and BEiT-based models and tested these models on unseen speech samples from\\nthe participants. In the results, we show that fine-tuning vision transformers\\non benchmark datasets and and then using either these already fine-tuned models\\nor ensembling ViT/BEiT models gets us the highest classification accuracies per\\nindividual when it comes to identifying four primary emotions from their\\nspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\\nor BEiTs.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"50 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10687\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了视觉变换器模型，即ViT（视觉变换器）和BEiT（图像变换器的BERT预训练）管道在人机交互中的语音情感识别（SER）应用。重点是通过在基准数据集上对这些模型进行微调，并利用集合方法，针对单个语音特征对 SER 模型进行泛化。为此，我们收集了不同人类受试者与NAO机器人进行伪自然对话的音频数据。然后，我们对基于ViT和BEiT的模型进行了微调，并在受试者未见过的语音样本上对这些模型进行了测试。结果表明，与微调ViTs或BEiTs相比，微调基准数据集的视觉变换，然后使用这些已微调的模型或ViT/BEiT模型的集合，在从语音中识别四种主要情绪（中性、快乐、悲伤和愤怒）时，每个人的分类准确率最高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量