EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models

IF 2.4 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Minxiao Wang, Ning Yang
{"title":"EmoAsst: emotion recognition assistant via text-guided transfer learning on pre-trained visual and acoustic models","authors":"Minxiao Wang, Ning Yang","doi":"10.3389/fcomp.2024.1304687","DOIUrl":null,"url":null,"abstract":"Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).","PeriodicalId":52823,"journal":{"name":"Frontiers in Computer Science","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fcomp.2024.1304687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Children diagnosed with Autism Spectrum Disorder (ASD) often struggle to grasp social conventions and promptly recognize others' emotions. Recent advancements in the application of deep learning (DL) to emotion recognition are solidifying the role of AI-powered assistive technology in supporting autistic children. However, the cost of collecting and annotating large-scale high-quality human emotion data and the phenomenon of unbalanced performance on different modalities of data challenge DL-based emotion recognition. In response to these challenges, this paper explores transfer learning, wherein large pre-trained models like Contrastive Language-Image Pre-training (CLIP) and wav2vec 2.0 are fine-tuned to improve audio- and video-based emotion recognition with text- based guidance. In this work, we propose the EmoAsst framework, which includes a visual fusion module and emotion prompt fine-tuning for CLIP, in addition to leveraging CLIP's text encoder and supervised contrastive learning for audio-based emotion recognition on the wav2vec 2.0 model. In addition, a joint few-shot emotion classifier enhances the accuracy and offers great adaptability for real-world applications. The evaluation results on the MELD dataset highlight the outstanding performance of our methods, surpassing the majority of existing video and audio-based approaches. Notably, our research demonstrates the promising potential of the proposed text-based guidance techniques for improving video and audio-based Emotion Recognition and Classification (ERC).
EmoAsst:通过预训练视觉和声学模型上的文本引导迁移学习进行情感识别的助手
被诊断患有自闭症谱系障碍(ASD)的儿童通常很难掌握社交惯例并及时识别他人的情绪。最近在应用深度学习(DL)进行情绪识别方面取得的进展,巩固了人工智能辅助技术在支持自闭症儿童方面的作用。然而,收集和注释大规模高质量人类情绪数据的成本,以及不同数据模式下的不平衡表现现象,都对基于深度学习的情绪识别提出了挑战。为了应对这些挑战,本文探讨了迁移学习,即对大型预训练模型(如对比语言-图像预训练(CLIP)和 wav2vec 2.0)进行微调,以改进基于音频和视频的情感识别,并提供基于文本的指导。在这项工作中,我们提出了 EmoAsst 框架,其中包括视觉融合模块和针对 CLIP 的情感提示微调,以及利用 CLIP 的文本编码器和监督对比学习,在 wav2vec 2.0 模型上进行基于音频的情感识别。此外,联合少量情感分类器提高了准确性,并为实际应用提供了极大的适应性。在 MELD 数据集上的评估结果表明,我们的方法性能卓越,超越了大多数现有的基于视频和音频的方法。值得注意的是,我们的研究证明了所提出的基于文本的引导技术在改进基于视频和音频的情感识别与分类(ERC)方面的巨大潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Frontiers in Computer Science
Frontiers in Computer Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-
CiteScore
4.30
自引率
0.00%
发文量
152
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信