Shuyi Mao;Xinpeng Li;Fan Zhang;Xiaojiang Peng;Yang Yang
{"title":"Facial Action Units as a Joint Dataset Training Bridge for Facial Expression Recognition","authors":"Shuyi Mao;Xinpeng Li;Fan Zhang;Xiaojiang Peng;Yang Yang","doi":"10.1109/TMM.2025.3535327","DOIUrl":null,"url":null,"abstract":"Label biases in facial expression recognition (FER) datasets, caused by annotators' subjectivity, pose challenges in improving the performance of target datasets when auxiliary labeled data are used. Moreover, training with multiple datasets can lead to visible degradations in the target dataset. To address these issues, we propose a novel framework called the AU-aware Vision Transformer (AU-ViT), which leverages unified action unit (AU) information and discards expression annotations of auxiliary data. AU-ViT integrates an elaborately designed AU branch in the middle part of a master ViT to enhance representation learning during training. Through qualitative and quantitative analyses, we demonstrate that AU-ViT effectively captures expression regions and is robust to real-world occlusions. Additionally, we observe that AU-ViT also yields performance improvements on the target dataset, even without auxiliary data, by utilizing pseudo AU labels. Our AU-ViT achieves performances superior to, or comparable to, that of the state-of-the-art methods on FERPlus, RAFDB, AffectNet, LSD and the other three occlusion test datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3331-3342"},"PeriodicalIF":8.4000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855502/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Label biases in facial expression recognition (FER) datasets, caused by annotators' subjectivity, pose challenges in improving the performance of target datasets when auxiliary labeled data are used. Moreover, training with multiple datasets can lead to visible degradations in the target dataset. To address these issues, we propose a novel framework called the AU-aware Vision Transformer (AU-ViT), which leverages unified action unit (AU) information and discards expression annotations of auxiliary data. AU-ViT integrates an elaborately designed AU branch in the middle part of a master ViT to enhance representation learning during training. Through qualitative and quantitative analyses, we demonstrate that AU-ViT effectively captures expression regions and is robust to real-world occlusions. Additionally, we observe that AU-ViT also yields performance improvements on the target dataset, even without auxiliary data, by utilizing pseudo AU labels. Our AU-ViT achieves performances superior to, or comparable to, that of the state-of-the-art methods on FERPlus, RAFDB, AffectNet, LSD and the other three occlusion test datasets.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.