Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-07-02 DOI:10.2196/68942

Marie Keinert, Simon Pistrosch, Adria Mallol-Ragolta, Björn W Schuller, Matthias Berking

{"title":"Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance.","authors":"Marie Keinert, Simon Pistrosch, Adria Mallol-Ragolta, Björn W Schuller, Matthias Berking","doi":"10.2196/68942","DOIUrl":null,"url":null,"abstract":"Background: The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.Objective: This study aimed to develop deep learning-based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models' performance, and validate them by comparing the models with human observers.Methods: The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)-based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.Results: Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.Conclusions: Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e68942"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12268218/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/68942","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.

Objective: This study aimed to develop deep learning-based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models' performance, and validate them by comparing the models with human observers.

Methods: The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)-based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.

Results: Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.

Conclusions: Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.

查看原文本刊更多论文

从智能手机视频中识别16种不同情绪的面部情绪：机器学习和人类表现的比较研究。

背景：基于智能手机视频的自动情绪识别模型的开发是促进情绪表达的心理治疗应用程序干预传播的关键一步。现有的模型主要关注6种基本情绪，而忽略了其他治疗相关的情绪。为了支持这项研究，我们引入了新的通过情绪识别减压训练的巫师-奥兹（STREs WoZ）数据集，其中包含16种不同的治疗相关情绪的面部视频。目的：建立基于深度学习的面部情绪自动识别（FER）模型，用于二元（积极和消极）和多类别的情绪分类任务，评估模型的性能，并通过与人类观察者的比较来验证模型的有效性。方法：STREs WoZ数据集包含63个人的14,412个面部视频，显示16种情绪。这些自拍风格的视频是在一个不受限制的实验室环境中，使用前置智能手机摄像头拍摄的减压训练过程中拍摄的。在STREs WoZ数据集上训练了使用外观特征和深度学习特征进行二元和多类情绪分类的自动FER模型。外观特征基于面部动作编码系统，并使用OpenFace进行提取。通过ResNet50模型获得深度学习特征。对于我们的深度学习模型，我们使用了外观特征、深度学习特征以及它们的连接作为输入。我们使用了3种基于循环神经网络（RNN）的架构：RNN-卷积、RNN-注意和RNN-平均网络。为了验证，还训练了3名人类观察员进行二进制和多类情感识别。由自动FER模型和人类观察者共同完成了包含16种情绪的3018个面部情绪视频的测试集。用未加权平均召回率（UAR）和准确率来评估其性能。结果：使用外观特征的模型在两项任务中的表现都优于使用深度学习特征的模型，以及结合两种特征类型的模型，其中使用外观特征的注意力网络是表现最好的模型。注意网络在二元分类任务中的准确率为92.9%，在多类分类任务中的准确率为59.0% ~ 90.0%。人类在二元分类任务中的表现与自动FER模型相当，UAR为91.0%，在多类分类任务中表现更好，准确率值在87.4% ~ 99.8%之间。结论：需要进一步的研究来提高自动脑电信号模型在心理治疗应用中的性能。然而，这项研究代表了通过智能手机应用程序推进以情绪为中心的心理治疗干预的重要的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.