{"title":"Dataset Construction and Effectiveness Evaluation of Spoken-Emotion Recognition for Human Machine Interaction","authors":"Mitsuki Okayama;Tatsuhito Hasegawa","doi":"10.1109/ACCESS.2025.3565537","DOIUrl":null,"url":null,"abstract":"The widespread use of large language models (LLMs) and voice-based agents has rapidly expanded Human-Computer Interaction (HCI) through spoken dialogue. To achieve more natural communication, nonverbal cues—especially those tied to emotional states—are critical and have been studied via deep learning. However, three key challenges persist in existing emotion recognition datasets: 1) most assume human-to-human interaction, neglecting shifts in speech patterns when users address a machine, 2) many include acted emotional expressions that differ from genuine internal states, and 3) even non-acted datasets often rely on third-party labels, creating potential mismatches with speakers’ actual emotions. Prior studies report that agreement between external labels and speakers’ internal states can be as low as 60–70%. To address these gaps, we present the VR-Self-Annotation Emotion Dataset (VSAED), consisting of 1,352 naturally induced and non-acted Japanese utterances (1.5 hours). Each utterance is labeled with self-reported internal emotional states spanning six categories. We investigated: 1) how effectively non-acted, machine-oriented speech conveys internal emotions, 2) whether speakers alter expressions when aware of an emotion recognition system, and 3) whether specific conditions yield notably high accuracy. In experiments using a HuBERT-based classifier, we achieve around 40% recognition accuracy, underscoring the complexity of capturing subtle internal emotions. These findings highlight the importance of domain-specific datasets for human-machine interactions.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"79084-79097"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979942","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979942/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The widespread use of large language models (LLMs) and voice-based agents has rapidly expanded Human-Computer Interaction (HCI) through spoken dialogue. To achieve more natural communication, nonverbal cues—especially those tied to emotional states—are critical and have been studied via deep learning. However, three key challenges persist in existing emotion recognition datasets: 1) most assume human-to-human interaction, neglecting shifts in speech patterns when users address a machine, 2) many include acted emotional expressions that differ from genuine internal states, and 3) even non-acted datasets often rely on third-party labels, creating potential mismatches with speakers’ actual emotions. Prior studies report that agreement between external labels and speakers’ internal states can be as low as 60–70%. To address these gaps, we present the VR-Self-Annotation Emotion Dataset (VSAED), consisting of 1,352 naturally induced and non-acted Japanese utterances (1.5 hours). Each utterance is labeled with self-reported internal emotional states spanning six categories. We investigated: 1) how effectively non-acted, machine-oriented speech conveys internal emotions, 2) whether speakers alter expressions when aware of an emotion recognition system, and 3) whether specific conditions yield notably high accuracy. In experiments using a HuBERT-based classifier, we achieve around 40% recognition accuracy, underscoring the complexity of capturing subtle internal emotions. These findings highlight the importance of domain-specific datasets for human-machine interactions.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.