{"title":"标签引导的动态时空融合技术用于基于视频的面部表情识别","authors":"Ziyang Zhang;Xiang Tian;Yuan Zhang;Kailing Guo;Xiangmin Xu","doi":"10.1109/TMM.2024.3407693","DOIUrl":null,"url":null,"abstract":"Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10503-10513"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition\",\"authors\":\"Ziyang Zhang;Xiang Tian;Yuan Zhang;Kailing Guo;Xiangmin Xu\",\"doi\":\"10.1109/TMM.2024.3407693\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10503-10513\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10552397/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10552397/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
基于视频的野生面部表情识别(FER)是一项常见但极具挑战性的任务。同时提取空间和时间特征是一种常见的方法,但由于空间和时间信息的不同性质,这种方法不一定总能获得最佳结果。然而,与基于图像的 FER 相比,基于视频的 FER 的结果有时并不理想,这表明在时空融合策略中没有充分利用每帧的空间信息,也没有对帧关系进行最佳建模。虽然帧标签与视频标签高度相关,但以往基于视频的 FER 方法却忽略了这一点。本文提出了标签引导的动态时空融合(LG-DSTF),它采用帧标签来增强空间特征的判别能力并引导时空融合。通过为每个帧分配一个视频标签,构建了两个辅助分类损失函数,以引导不同层次的空间特征学习。利用空间特征的均匀分布和标签分布之间的交叉熵来衡量每个帧的分类置信度。置信度值可作为动态权重,在空间特征的时间融合过程中强调关键帧。我们的 LG-DSTF 在 FER 基准上取得了最先进的结果。
Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition
Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.