Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Data Technologies and Applications Pub Date : 2023-11-01 DOI:10.1108/dta-08-2023-0414

Juan Yang, Zhenkun Li, Xu Du

{"title":"Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment","authors":"Juan Yang, Zhenkun Li, Xu Du","doi":"10.1108/dta-08-2023-0414","DOIUrl":null,"url":null,"abstract":"Purpose Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations. Design/methodology/approach A novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations. Findings Extensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance. Originality/value The proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"312 1","pages":"0"},"PeriodicalIF":1.5000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/dta-08-2023-0414","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations. Design/methodology/approach A novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations. Findings Extensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance. Originality/value The proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

查看原文本刊更多论文

在人机交互环境中，通过分析视听数据来理解用户的情感

虽然情绪识别的信号方式多种多样，但听觉和视觉是人类在日常交流中表达情绪状态的最常见和最主要的形式。因此，如何实现自动准确的视听情感识别，对于开发引人入胜、共情的人机交互环境具有重要意义。然而，在视听情感识别领域存在两个主要挑战:(1)如何有效地捕获每个单一模态的表征并消除冗余特征;(2)如何有效地整合这两个模态的信息以生成判别表征。提出了一种新的基于关键帧提取的注意力融合网络(KE-AFN)用于视听情感识别。KE-AFN试图将关键帧提取与多模态交互融合结合起来，增强视听表征，减少冗余计算，填补了现有方法的研究空白。具体而言，基于局部最大值的内容分析旨在从视频中提取关键帧，以消除数据冗余。提出了“基于多头注意力的模态内交互模块”和“基于多头注意力的跨模态交互模块”两个模块来挖掘和捕获模态内交互和跨模态交互，以进一步减少数据冗余并产生更强大的多模态表示。在两个基准数据集(即RAVDESS和CMU-MOSEI)上的大量实验证明了KE-AFN的有效性和合理性。具体而言，(1)KE-AFN在视听情感识别方面优于最先进的基线。(2)探索不同模态的互补性信息可以为更好的情绪识别提供更多的情绪线索。(3)所提出的关键帧提取策略的准确率提高了2.79%以上。(4)探索模态内和模态间的相互作用以及采用基于注意力的视听融合都可以提高预测效果。提出的KE-AFN可以支持参与和共情人机交互环境的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Technologies and Applications Social Sciences-Library and Information Sciences

CiteScore

3.80

自引率

6.20%

发文量

期刊介绍： Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies