DMAGaze : Gaze estimation using feature disentanglement and multi-scale attention

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2026-03-01 Epub Date: 2026-01-13 DOI:10.1016/j.patrec.2026.01.013

Haohan Chen , Hongjia Liu , Shiyong Lan , Wenwu Wang , Yixin Qiao , Yao Li , Guonan Deng

{"title":"DMAGaze : Gaze estimation using feature disentanglement and multi-scale attention","authors":"Haohan Chen , Hongjia Liu , Shiyong Lan , Wenwu Wang , Yixin Qiao , Yao Li , Guonan Deng","doi":"10.1016/j.patrec.2026.01.013","DOIUrl":null,"url":null,"abstract":"<div><div>Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images—a key bottleneck limiting its accuracy in real-world scenarios. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose related features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to separate gaze-relevant and gaze-irrelevant information in facial images through reconstructing the eye and non-eye regions using a dual-branch architecture. Furthermore, we introduce a new attention module, called Multi-Scale Global Local Attention Module (MS-GLAM), to fuse the global and local information at multiple scales via a customized attention structure, thereby further enhancing the information from the Disentangler. Finally, we combine the global gaze-relevant features, with head pose and local eye features, and pass them through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been evaluated extensively on two widely used public datasets: obtaining a gaze estimation error of 3.74° on MPIIFaceGaze and 6.17° on RT-GENE, outperforming SOTA methods.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"201 ","pages":"Pages 109-116"},"PeriodicalIF":3.3000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865526000218","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/13 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images—a key bottleneck limiting its accuracy in real-world scenarios. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose related features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to separate gaze-relevant and gaze-irrelevant information in facial images through reconstructing the eye and non-eye regions using a dual-branch architecture. Furthermore, we introduce a new attention module, called Multi-Scale Global Local Attention Module (MS-GLAM), to fuse the global and local information at multiple scales via a customized attention structure, thereby further enhancing the information from the Disentangler. Finally, we combine the global gaze-relevant features, with head pose and local eye features, and pass them through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been evaluated extensively on two widely used public datasets: obtaining a gaze estimation error of 3.74° on MPIIFaceGaze and 6.17° on RT-GENE, outperforming SOTA methods.

查看原文本刊更多论文

基于特征解纠缠和多尺度注意力的凝视估计

用于预测注视方向的注视估计通常面临着人脸图像中复杂注视无关信息干扰的挑战，这是限制其在现实场景中准确性的关键瓶颈。在这项工作中，我们提出了一种新的凝视估计框架DMAGaze，该框架从三个方面利用面部图像中的信息：凝视相关的全局特征（从面部图像中提取）、局部眼睛特征（从裁剪的眼罩中提取）和头部姿势相关特征，以提高整体性能。首先，我们设计了一种新的基于连续面具的解纠缠器，通过双分支结构重构眼睛和非眼睛区域，分离出人脸图像中与凝视相关和不相关的信息。此外，我们引入了一种新的注意力模块，称为多尺度全局局部注意力模块（MS-GLAM），通过定制的注意力结构融合多尺度的全局和局部信息，从而进一步增强来自解纠缠器的信息。最后，我们将全局凝视相关特征与头部姿态和局部眼睛特征结合起来，通过检测头进行高精度凝视估计。我们提出的DMAGaze已经在两个广泛使用的公共数据集上进行了广泛的评估：在MPIIFaceGaze和RT-GENE上获得了3.74°和6.17°的凝视估计误差，优于SOTA方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.