Computer Vision and Image Understanding最新文献

筛选
英文 中文
Invisible backdoor attack with attention and steganography 利用注意力和隐写术进行隐形后门攻击
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-19 DOI: 10.1016/j.cviu.2024.104208
Wenmin Chen, Xiaowei Xu, Xiaodong Wang, Huasong Zhou, Zewen Li, Yangming Chen
{"title":"Invisible backdoor attack with attention and steganography","authors":"Wenmin Chen,&nbsp;Xiaowei Xu,&nbsp;Xiaodong Wang,&nbsp;Huasong Zhou,&nbsp;Zewen Li,&nbsp;Yangming Chen","doi":"10.1016/j.cviu.2024.104208","DOIUrl":"10.1016/j.cviu.2024.104208","url":null,"abstract":"<div><div>Recently, with the development and widespread application of deep neural networks (DNNs), backdoor attacks have posed new security threats to the training process of DNNs. Backdoor attacks on neural networks undermine the security and trustworthiness of DNNs by implanting hidden, unauthorized triggers, leading to benign behavior on clean samples while exhibiting malicious behavior on samples containing backdoor triggers. Existing backdoor attacks typically employ triggers that are sample-agnostic and identical for each sample, resulting in poisoned images that lack naturalness and are ineffective against existing backdoor defenses. To address these issues, this paper proposes a novel stealthy backdoor attack, where the backdoor trigger is dynamic and specific to each sample. Specifically, we leverage spatial attention on images and pre-trained models to obtain dynamic triggers, which are then injected using an encoder–decoder network. The design of the injection network benefits from recent advances in steganography research. To demonstrate the effectiveness of the proposed steganographic network, we design two backdoor attack modes named ASBA and ATBA, where ASBA utilizes the steganographic network for attack, while ATBA is a backdoor attack without steganography. Subsequently, we conducted attacks on Deep Neural Networks (DNNs) using four standard datasets. Our extensive experiments show that ASBA surpasses ATBA in terms of stealthiness and resilience against current defensive measures. Furthermore, both ASBA and ATBA demonstrate superior attack efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104208"},"PeriodicalIF":4.3,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NeRFtrinsic Four: An end-to-end trainable NeRF jointly optimizing diverse intrinsic and extrinsic camera parameters NeRFtrinsic Four:端到端可训练 NeRF,联合优化各种内在和外在相机参数
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-18 DOI: 10.1016/j.cviu.2024.104206
Hannah Schieber , Fabian Deuser , Bernhard Egger , Norbert Oswald , Daniel Roth
{"title":"NeRFtrinsic Four: An end-to-end trainable NeRF jointly optimizing diverse intrinsic and extrinsic camera parameters","authors":"Hannah Schieber ,&nbsp;Fabian Deuser ,&nbsp;Bernhard Egger ,&nbsp;Norbert Oswald ,&nbsp;Daniel Roth","doi":"10.1016/j.cviu.2024.104206","DOIUrl":"10.1016/j.cviu.2024.104206","url":null,"abstract":"<div><div>Novel view synthesis using neural radiance fields (NeRF) is the state-of-the-art technique for generating high-quality images from novel viewpoints. Existing methods require a priori knowledge about extrinsic and intrinsic camera parameters. This limits their applicability to synthetic scenes, or real-world scenarios with the necessity of a preprocessing step. Current research on the joint optimization of camera parameters and NeRF focuses on refining noisy extrinsic camera parameters and often relies on the preprocessing of intrinsic camera parameters. Further approaches are limited to cover only one single camera intrinsic. To address these limitations, we propose a novel end-to-end trainable approach called NeRFtrinsic Four. We utilize Gaussian Fourier features to estimate extrinsic camera parameters and dynamically predict varying intrinsic camera parameters through the supervision of the projection error. Our approach outperforms existing joint optimization methods on LLFF and BLEFF. In addition to these existing datasets, we introduce a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic Four is a step forward in joint optimization NeRF-based view synthesis and enables more realistic and flexible rendering in real-world scenarios with varying camera parameters.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104206"},"PeriodicalIF":4.3,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight cross-modal transformer for RGB-D salient object detection 用于 RGB-D 突出物体检测的轻量级跨模态变换器
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-17 DOI: 10.1016/j.cviu.2024.104194
Nianchang Huang , Yang Yang , Qiang Zhang , Jungong Han , Jin Huang
{"title":"Lightweight cross-modal transformer for RGB-D salient object detection","authors":"Nianchang Huang ,&nbsp;Yang Yang ,&nbsp;Qiang Zhang ,&nbsp;Jungong Han ,&nbsp;Jin Huang","doi":"10.1016/j.cviu.2024.104194","DOIUrl":"10.1016/j.cviu.2024.104194","url":null,"abstract":"<div><div>Recently, Transformer-based RGB-D salient object detection (SOD) models have pushed the performance to a new level. However, they come at the cost of consuming abundant resources, including memory and power, thus hindering their real-life applications. To remedy this situation, a novel lightweight cross-modal Transformer (LCT) for RGB-D SOD will be presented in this paper. Specifically, LCT will first reduce its parameters and computational costs by employing a middle-level feature fusion structure and taking a lightweight Transformer as the backbone. Then, with the aid of Transformers, it will compensate for performance degradation by effectively capturing the cross-modal and cross-level complementary information from the multi-modal input images. To this end, a cross-modal enhancement and fusion module (CEFM) with a lightweight channel-wise cross attention block (LCCAB) will be designed to capture the cross-modal complementary information effectively but with fewer costs. A bi-directional multi-level feature interaction module (Bi-MFIM) with a lightweight spatial-wise cross attention block (LSCAB) will be designed to capture the cross-level complementary context information. By virtue of CEFM and Bi-MFIM, the performance degradation caused by parameter reduction can be well compensated, thus boosting the performances. By doing so, our proposed model has only 2.8M parameters with 7.6G FLOPs and runs at 66 FPS. Furthermore, experimental results on several benchmark datasets show that our proposed model can achieve competitive or even better results than other models. Our code will be released on <span><span>https://github.com/nexiakele/lightweight-cross-modal-Transformer-LCT-for-RGB-D-SOD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104194"},"PeriodicalIF":4.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning PMGNet:互不纠缠和纠缠互利,促进合成零点学习
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-16 DOI: 10.1016/j.cviu.2024.104197
Yu Liu , Jianghao Li , Yanyi Zhang , Qi Jia , Weimin Wang , Nan Pu , Nicu Sebe
{"title":"PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning","authors":"Yu Liu ,&nbsp;Jianghao Li ,&nbsp;Yanyi Zhang ,&nbsp;Qi Jia ,&nbsp;Weimin Wang ,&nbsp;Nan Pu ,&nbsp;Nicu Sebe","doi":"10.1016/j.cviu.2024.104197","DOIUrl":"10.1016/j.cviu.2024.104197","url":null,"abstract":"<div><div>Compositional zero-shot learning (CZSL) aims to model compositions of two primitives (i.e., attributes and objects) to classify unseen attribute-object pairs. Most studies are devoted to integrating disentanglement and entanglement strategies to circumvent the trade-off between contextuality and generalizability. Indeed, the two strategies can mutually benefit when used together. Nevertheless, they neglect the significance of developing mutual guidance between the two strategies. In this work, we take full advantage of guidance from disentanglement to entanglement and vice versa. Additionally, we propose exploring multi-scale feature learning to achieve fine-grained mutual guidance in a progressive framework. Our approach, termed Progressive Mutual Guidance Network (PMGNet), unifies disentanglement–entanglement representation learning, allowing them to learn from and teach each other progressively in one unified model. Furthermore, to alleviate overfitting recognition on seen pairs, we adopt a relaxed cross-entropy loss to train PMGNet, without an increase of time and memory cost. Extensive experiments on three benchmarks demonstrate that our method achieves distinct improvements, reaching state-of-the-art performance. Moreover, PMGNet exhibits promising performance under the most challenging open-world CZSL setting, especially for unseen pairs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104197"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FTM: The Face Truth Machine—Hand-crafted features from micro-expressions to support lie detection FTM:面部真实机器--从微表情中手工创建特征,支持谎言检测
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-16 DOI: 10.1016/j.cviu.2024.104188
Maria De Marsico, Giordano Dionisi, Donato Francesco Pio Stanco
{"title":"FTM: The Face Truth Machine—Hand-crafted features from micro-expressions to support lie detection","authors":"Maria De Marsico,&nbsp;Giordano Dionisi,&nbsp;Donato Francesco Pio Stanco","doi":"10.1016/j.cviu.2024.104188","DOIUrl":"10.1016/j.cviu.2024.104188","url":null,"abstract":"<div><div>This work deals with the delicate task of lie detection from facial dynamics. The proposed Face Truth Machine (FTM) is an intelligent system able to support a human operator without any special equipment. It can be embedded in the present infrastructures for forensic investigation or whenever it is required to assess the trustworthiness of responses during an interview. Due to its flexibility and its non-invasiveness, it can overcome some limitations of present solutions. Of course, privacy issues may arise from the use of such systems, as often underlined nowadays. However, it is up to the utilizer to take these into account and make fair use of tools of this kind. The paper will discuss particular aspects of the dynamic analysis of face landmarks to detect lies. In particular, it will delve into the behavior of the features used for detection and how these influence the system’s final decision. The novel detection system underlying the Face Truth Machine is able to analyze the subject’s expressions in a wide range of poses. The results of the experiments presented testify to the potential of the proposed approach and also highlight the very good results obtained in cross-dataset testing, which usually represents a challenge for other approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104188"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
M3A: A multimodal misinformation dataset for media authenticity analysis M3A:用于媒体真实性分析的多模态错误信息数据集
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-15 DOI: 10.1016/j.cviu.2024.104205
Qingzheng Xu , Huiqiang Chen , Heming Du , Hu Zhang , Szymon Łukasik , Tianqing Zhu , Xin Yu
{"title":"M3A: A multimodal misinformation dataset for media authenticity analysis","authors":"Qingzheng Xu ,&nbsp;Huiqiang Chen ,&nbsp;Heming Du ,&nbsp;Hu Zhang ,&nbsp;Szymon Łukasik ,&nbsp;Tianqing Zhu ,&nbsp;Xin Yu","doi":"10.1016/j.cviu.2024.104205","DOIUrl":"10.1016/j.cviu.2024.104205","url":null,"abstract":"<div><div>With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (<span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>), featuring broad sources and fine-grained annotations for topics and sentiments. To curate <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>, we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104205"},"PeriodicalIF":4.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Region-aware image-based human action retrieval with transformers 利用变换器进行基于区域感知图像的人体动作检索
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-14 DOI: 10.1016/j.cviu.2024.104202
Hongsong Wang , Jianhua Zhao , Jie Gui
{"title":"Region-aware image-based human action retrieval with transformers","authors":"Hongsong Wang ,&nbsp;Jianhua Zhao ,&nbsp;Jie Gui","doi":"10.1016/j.cviu.2024.104202","DOIUrl":"10.1016/j.cviu.2024.104202","url":null,"abstract":"<div><div>Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present a Transformer-based model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A fusion transformer is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on both the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method significantly outperforms previous approaches for image-based action retrieval.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104202"},"PeriodicalIF":4.3,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simple but effective vision transformer framework for visible–infrared person re-identification 用于可见光-红外线人员再识别的简单而有效的视觉转换器框架
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-11 DOI: 10.1016/j.cviu.2024.104192
Yudong Li , Sanyuan Zhao , Jianbing Shen
{"title":"A simple but effective vision transformer framework for visible–infrared person re-identification","authors":"Yudong Li ,&nbsp;Sanyuan Zhao ,&nbsp;Jianbing Shen","doi":"10.1016/j.cviu.2024.104192","DOIUrl":"10.1016/j.cviu.2024.104192","url":null,"abstract":"<div><div>In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT’s appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT’s progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT’s applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104192"},"PeriodicalIF":4.3,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An end-to-end tracking framework via multi-view and temporal feature aggregation 通过多视角和时间特征聚合实现端到端跟踪框架
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-10 DOI: 10.1016/j.cviu.2024.104203
Yihan Yang , Ming Xu , Jason F. Ralph , Yuchen Ling , Xiaonan Pan
{"title":"An end-to-end tracking framework via multi-view and temporal feature aggregation","authors":"Yihan Yang ,&nbsp;Ming Xu ,&nbsp;Jason F. Ralph ,&nbsp;Yuchen Ling ,&nbsp;Xiaonan Pan","doi":"10.1016/j.cviu.2024.104203","DOIUrl":"10.1016/j.cviu.2024.104203","url":null,"abstract":"<div><div>Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104203"},"PeriodicalIF":4.3,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A fast differential network with adaptive reference sample for gaze estimation 用于凝视估计的带有自适应参考样本的快速差分网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104156
Jiahui Hu , Yonghua Lu , Xiyuan Ye , Qiang Feng , Lihua Zhou
{"title":"A fast differential network with adaptive reference sample for gaze estimation","authors":"Jiahui Hu ,&nbsp;Yonghua Lu ,&nbsp;Xiyuan Ye ,&nbsp;Qiang Feng ,&nbsp;Lihua Zhou","doi":"10.1016/j.cviu.2024.104156","DOIUrl":"10.1016/j.cviu.2024.104156","url":null,"abstract":"<div><div>Most non-invasive gaze estimation methods do not consider the inter-individual differences in anatomical structure, but directly regress the gaze direction from the appearance image information, which limits the accuracy of individual-independent gaze estimation networks. In addition, existing gaze estimation methods tend to consider only how to improve the model’s generalization performance, ignoring the crucial issue of efficiency, which leads to bulky models that are difficult to deploy and have questionable cost-effectiveness in practical use. This paper makes the following contributions: (1) A differential network for gaze estimation using adaptive reference samples is proposed, which can adaptively select reference samples based on scene and individual characteristics. (2) The knowledge distillation is used to transfer the knowledge structure of robust teacher networks into lightweight networks so that our networks can execute quickly and at low computational cost, dramatically increasing the prospect and value of applying gaze estimation. (3) Integrating the above innovations, a novel fast differential neural network (Diff-Net) named FDAR-Net is constructed and achieved excellent results on MPIIGaze, UTMultiview and EyeDiap.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104156"},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信