{"title":"集成端到端多模态深度学习和领域自适应的鲁棒面部表情识别","authors":"Mahmoud Hassaballah , Chiara Pero , Ranjeet Kumar Rout , Saiyed Umer","doi":"10.1016/j.imavis.2025.105548","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents an advanced approach to a facial expression recognition (FER) system designed for robust performance across diverse imaging environments. The proposed method consists of four primary components: image preprocessing, feature representation and classification, cross-domain feature analysis, and domain adaptation. The process begins with facial region extraction from input images, including those captured in unconstrained imaging conditions, where variations in lighting, background, and image quality significantly impact recognition performance. The extracted facial region undergoes feature extraction using an ensemble of multimodal deep learning techniques, including end-to-end CNNs, BilinearCNN, TrilinearCNN, and pretrained CNN models, which capture both local and global facial features with high precision. The ensemble approach enriches feature representation by integrating information from multiple models, enhancing the system’s ability to generalize across different subjects and expressions. These deep features are then passed to a classifier trained to recognize facial expressions effectively in real-time scenarios. Since images captured in real-world conditions often contain noise and artifacts that can compromise accuracy, cross-domain analysis is performed to evaluate the discriminative power and robustness of the extracted deep features. FER systems typically experience performance degradation when applied to domains that differ from the original training environment. To mitigate this issue, domain adaptation techniques are incorporated, enabling the system to effectively adjust to new imaging conditions and improving recognition accuracy even in challenging real-time acquisition environments. The proposed FER system is validated using four well-established benchmark datasets: CK+, KDEF, IMFDB and AffectNet. Experimental results demonstrate that the proposed system achieves high performance within original domains and exhibits superior cross-domain recognition compared to existing state-of-the-art methods. These findings indicate that the system is highly reliable for applications requiring robust and adaptive FER capabilities across varying imaging conditions and domains.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105548"},"PeriodicalIF":4.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Integrating end-to-end multimodal deep learning and domain adaptation for robust facial expression recognition\",\"authors\":\"Mahmoud Hassaballah , Chiara Pero , Ranjeet Kumar Rout , Saiyed Umer\",\"doi\":\"10.1016/j.imavis.2025.105548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper presents an advanced approach to a facial expression recognition (FER) system designed for robust performance across diverse imaging environments. The proposed method consists of four primary components: image preprocessing, feature representation and classification, cross-domain feature analysis, and domain adaptation. The process begins with facial region extraction from input images, including those captured in unconstrained imaging conditions, where variations in lighting, background, and image quality significantly impact recognition performance. The extracted facial region undergoes feature extraction using an ensemble of multimodal deep learning techniques, including end-to-end CNNs, BilinearCNN, TrilinearCNN, and pretrained CNN models, which capture both local and global facial features with high precision. The ensemble approach enriches feature representation by integrating information from multiple models, enhancing the system’s ability to generalize across different subjects and expressions. These deep features are then passed to a classifier trained to recognize facial expressions effectively in real-time scenarios. Since images captured in real-world conditions often contain noise and artifacts that can compromise accuracy, cross-domain analysis is performed to evaluate the discriminative power and robustness of the extracted deep features. FER systems typically experience performance degradation when applied to domains that differ from the original training environment. To mitigate this issue, domain adaptation techniques are incorporated, enabling the system to effectively adjust to new imaging conditions and improving recognition accuracy even in challenging real-time acquisition environments. The proposed FER system is validated using four well-established benchmark datasets: CK+, KDEF, IMFDB and AffectNet. Experimental results demonstrate that the proposed system achieves high performance within original domains and exhibits superior cross-domain recognition compared to existing state-of-the-art methods. These findings indicate that the system is highly reliable for applications requiring robust and adaptive FER capabilities across varying imaging conditions and domains.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"159 \",\"pages\":\"Article 105548\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001362\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001362","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Integrating end-to-end multimodal deep learning and domain adaptation for robust facial expression recognition
This paper presents an advanced approach to a facial expression recognition (FER) system designed for robust performance across diverse imaging environments. The proposed method consists of four primary components: image preprocessing, feature representation and classification, cross-domain feature analysis, and domain adaptation. The process begins with facial region extraction from input images, including those captured in unconstrained imaging conditions, where variations in lighting, background, and image quality significantly impact recognition performance. The extracted facial region undergoes feature extraction using an ensemble of multimodal deep learning techniques, including end-to-end CNNs, BilinearCNN, TrilinearCNN, and pretrained CNN models, which capture both local and global facial features with high precision. The ensemble approach enriches feature representation by integrating information from multiple models, enhancing the system’s ability to generalize across different subjects and expressions. These deep features are then passed to a classifier trained to recognize facial expressions effectively in real-time scenarios. Since images captured in real-world conditions often contain noise and artifacts that can compromise accuracy, cross-domain analysis is performed to evaluate the discriminative power and robustness of the extracted deep features. FER systems typically experience performance degradation when applied to domains that differ from the original training environment. To mitigate this issue, domain adaptation techniques are incorporated, enabling the system to effectively adjust to new imaging conditions and improving recognition accuracy even in challenging real-time acquisition environments. The proposed FER system is validated using four well-established benchmark datasets: CK+, KDEF, IMFDB and AffectNet. Experimental results demonstrate that the proposed system achieves high performance within original domains and exhibits superior cross-domain recognition compared to existing state-of-the-art methods. These findings indicate that the system is highly reliable for applications requiring robust and adaptive FER capabilities across varying imaging conditions and domains.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.