ME-FAS: Multimodal Text Enhancement for Cross-Domain Face Anti-Spoofing

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-03-19 DOI:10.1109/TIFS.2025.3571660

Lvpan Cai;Haowei Wang;Jiayi Ji;Xiaoshuai Sun;Liujuan Cao;Rongrong Ji

{"title":"ME-FAS: Multimodal Text Enhancement for Cross-Domain Face Anti-Spoofing","authors":"Lvpan Cai;Haowei Wang;Jiayi Ji;Xiaoshuai Sun;Liujuan Cao;Rongrong Ji","doi":"10.1109/TIFS.2025.3571660","DOIUrl":null,"url":null,"abstract":"The focus of Face Anti-Spoofing (FAS) is shifting toward improving generalization performance in unseen scenarios. Traditional methods employing adversarial learning and meta-learning aim to extract or decouple generalizable features to address these challenges. However, enhancing performance solely through facial features remains challenging without additional informative inputs. To address this, Vision-Language Models (VLMs) with robust generalization capabilities have recently been introduced to FAS. Despite their potential, these VLMs typically adopt a late alignment strategy, relying only on encoder output features for modality alignment, which largely neglects mutual guidance between modalities. To bridge this gap, inspired by recent advancements in prompt learning, we employ learnable prompts and masking as intermediaries to enhance interaction between text and visual modalities, enabling the extraction of more generalizable features. Specifically, we propose ME-FAS, a Modality-Enhanced cross-domain FAS model integrating Prompt Fusion Transfer (PFT) and Text-guided Image Masking (TIM). PFT facilitates the integration of text features with visual information, improving domain adaptability in alignment with the textual context. Meanwhile, TIM leverages text features to mask image patches, directing visual features toward critical generalizable facial information, such as the eyes and mouth. Comprehensive evaluations across multiple benchmarks and various visualizations demonstrate significant performance gains, validating the effectiveness of our proposed approach. Our code and models are available at <uri>https://github.com/clpbc/ME-FAS</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"5451-5464"},"PeriodicalIF":6.3000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007113/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The focus of Face Anti-Spoofing (FAS) is shifting toward improving generalization performance in unseen scenarios. Traditional methods employing adversarial learning and meta-learning aim to extract or decouple generalizable features to address these challenges. However, enhancing performance solely through facial features remains challenging without additional informative inputs. To address this, Vision-Language Models (VLMs) with robust generalization capabilities have recently been introduced to FAS. Despite their potential, these VLMs typically adopt a late alignment strategy, relying only on encoder output features for modality alignment, which largely neglects mutual guidance between modalities. To bridge this gap, inspired by recent advancements in prompt learning, we employ learnable prompts and masking as intermediaries to enhance interaction between text and visual modalities, enabling the extraction of more generalizable features. Specifically, we propose ME-FAS, a Modality-Enhanced cross-domain FAS model integrating Prompt Fusion Transfer (PFT) and Text-guided Image Masking (TIM). PFT facilitates the integration of text features with visual information, improving domain adaptability in alignment with the textual context. Meanwhile, TIM leverages text features to mask image patches, directing visual features toward critical generalizable facial information, such as the eyes and mouth. Comprehensive evaluations across multiple benchmarks and various visualizations demonstrate significant performance gains, validating the effectiveness of our proposed approach. Our code and models are available at https://github.com/clpbc/ME-FAS

查看原文本刊更多论文

跨域人脸防欺骗的多模态文本增强

人脸反欺骗（FAS）的研究重点正转向提高未知场景下的泛化性能。采用对抗学习和元学习的传统方法旨在提取或解耦可推广的特征来解决这些挑战。然而，如果没有额外的信息输入，仅通过面部特征来提高性能仍然具有挑战性。为了解决这个问题，具有强大泛化能力的视觉语言模型（VLMs）最近被引入到FAS中。尽管具有潜力，但这些vlm通常采用后期对齐策略，仅依靠编码器输出特征进行模态对齐，这在很大程度上忽略了模态之间的相互引导。为了弥补这一差距，受提示学习的最新进展的启发，我们采用可学习的提示和掩蔽作为中介来增强文本和视觉模式之间的交互，从而能够提取更多可概括的特征。具体来说，我们提出了ME-FAS，这是一种模态增强的跨域FAS模型，集成了提示融合传输（PFT）和文本引导图像掩蔽（TIM）。PFT促进了文本特征与视觉信息的集成，提高了与文本上下文对齐的领域适应性。同时，TIM利用文本特征来掩盖图像补丁，将视觉特征指向关键的可概括的面部信息，如眼睛和嘴巴。跨多个基准测试和各种可视化的综合评估显示了显著的性能提升，验证了我们提出的方法的有效性。我们的代码和模型可在https://github.com/clpbc/ME-FAS上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features