Dataset Construction and Effectiveness Evaluation of Spoken-Emotion Recognition for Human Machine Interaction

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-04-29 DOI:10.1109/ACCESS.2025.3565537

Mitsuki Okayama;Tatsuhito Hasegawa

{"title":"Dataset Construction and Effectiveness Evaluation of Spoken-Emotion Recognition for Human Machine Interaction","authors":"Mitsuki Okayama;Tatsuhito Hasegawa","doi":"10.1109/ACCESS.2025.3565537","DOIUrl":null,"url":null,"abstract":"The widespread use of large language models (LLMs) and voice-based agents has rapidly expanded Human-Computer Interaction (HCI) through spoken dialogue. To achieve more natural communication, nonverbal cues—especially those tied to emotional states—are critical and have been studied via deep learning. However, three key challenges persist in existing emotion recognition datasets: 1) most assume human-to-human interaction, neglecting shifts in speech patterns when users address a machine, 2) many include acted emotional expressions that differ from genuine internal states, and 3) even non-acted datasets often rely on third-party labels, creating potential mismatches with speakers’ actual emotions. Prior studies report that agreement between external labels and speakers’ internal states can be as low as 60–70%. To address these gaps, we present the VR-Self-Annotation Emotion Dataset (VSAED), consisting of 1,352 naturally induced and non-acted Japanese utterances (1.5 hours). Each utterance is labeled with self-reported internal emotional states spanning six categories. We investigated: 1) how effectively non-acted, machine-oriented speech conveys internal emotions, 2) whether speakers alter expressions when aware of an emotion recognition system, and 3) whether specific conditions yield notably high accuracy. In experiments using a HuBERT-based classifier, we achieve around 40% recognition accuracy, underscoring the complexity of capturing subtle internal emotions. These findings highlight the importance of domain-specific datasets for human-machine interactions.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"79084-79097"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979942","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979942/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The widespread use of large language models (LLMs) and voice-based agents has rapidly expanded Human-Computer Interaction (HCI) through spoken dialogue. To achieve more natural communication, nonverbal cues—especially those tied to emotional states—are critical and have been studied via deep learning. However, three key challenges persist in existing emotion recognition datasets: 1) most assume human-to-human interaction, neglecting shifts in speech patterns when users address a machine, 2) many include acted emotional expressions that differ from genuine internal states, and 3) even non-acted datasets often rely on third-party labels, creating potential mismatches with speakers’ actual emotions. Prior studies report that agreement between external labels and speakers’ internal states can be as low as 60–70%. To address these gaps, we present the VR-Self-Annotation Emotion Dataset (VSAED), consisting of 1,352 naturally induced and non-acted Japanese utterances (1.5 hours). Each utterance is labeled with self-reported internal emotional states spanning six categories. We investigated: 1) how effectively non-acted, machine-oriented speech conveys internal emotions, 2) whether speakers alter expressions when aware of an emotion recognition system, and 3) whether specific conditions yield notably high accuracy. In experiments using a HuBERT-based classifier, we achieve around 40% recognition accuracy, underscoring the complexity of capturing subtle internal emotions. These findings highlight the importance of domain-specific datasets for human-machine interactions.

查看原文本刊更多论文

面向人机交互的语音情感识别数据集构建与有效性评价

大型语言模型（llm）和基于语音的代理的广泛使用通过口语对话迅速扩展了人机交互（HCI）。为了实现更自然的交流，非语言暗示——尤其是那些与情绪状态有关的暗示——是至关重要的，并且已经通过深度学习进行了研究。然而，现有的情感识别数据集仍然存在三个关键挑战：1)大多数假设人与人之间的互动，忽略了用户对机器说话时语音模式的变化；2)许多数据集包括与真实内部状态不同的行为情感表达；3)即使是非行为数据集也经常依赖第三方标签，从而与说话者的实际情绪产生潜在的不匹配。先前的研究报告称，外部标签与说话者内部状态之间的一致性可低至60-70%。为了解决这些差距，我们提出了vr -自我注释情感数据集（VSAED），包括1352个自然诱导和非行为的日语话语（1.5小时）。每句话都标有自我报告的内部情绪状态，分为六类。我们研究了：1)无动作的、面向机器的语音如何有效地传达内部情绪，2)说话者在意识到情绪识别系统时是否会改变表达，以及3)特定条件是否产生显著的高准确性。在使用基于hubert的分类器的实验中，我们实现了大约40%的识别准确率，强调了捕捉微妙内在情绪的复杂性。这些发现强调了特定领域数据集对人机交互的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.