Lvpan Cai;Haowei Wang;Jiayi Ji;Xiaoshuai Sun;Liujuan Cao;Rongrong Ji
{"title":"ME-FAS: Multimodal Text Enhancement for Cross-Domain Face Anti-Spoofing","authors":"Lvpan Cai;Haowei Wang;Jiayi Ji;Xiaoshuai Sun;Liujuan Cao;Rongrong Ji","doi":"10.1109/TIFS.2025.3571660","DOIUrl":null,"url":null,"abstract":"The focus of Face Anti-Spoofing (FAS) is shifting toward improving generalization performance in unseen scenarios. Traditional methods employing adversarial learning and meta-learning aim to extract or decouple generalizable features to address these challenges. However, enhancing performance solely through facial features remains challenging without additional informative inputs. To address this, Vision-Language Models (VLMs) with robust generalization capabilities have recently been introduced to FAS. Despite their potential, these VLMs typically adopt a late alignment strategy, relying only on encoder output features for modality alignment, which largely neglects mutual guidance between modalities. To bridge this gap, inspired by recent advancements in prompt learning, we employ learnable prompts and masking as intermediaries to enhance interaction between text and visual modalities, enabling the extraction of more generalizable features. Specifically, we propose ME-FAS, a Modality-Enhanced cross-domain FAS model integrating Prompt Fusion Transfer (PFT) and Text-guided Image Masking (TIM). PFT facilitates the integration of text features with visual information, improving domain adaptability in alignment with the textual context. Meanwhile, TIM leverages text features to mask image patches, directing visual features toward critical generalizable facial information, such as the eyes and mouth. Comprehensive evaluations across multiple benchmarks and various visualizations demonstrate significant performance gains, validating the effectiveness of our proposed approach. Our code and models are available at <uri>https://github.com/clpbc/ME-FAS</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"5451-5464"},"PeriodicalIF":6.3000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11007113/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The focus of Face Anti-Spoofing (FAS) is shifting toward improving generalization performance in unseen scenarios. Traditional methods employing adversarial learning and meta-learning aim to extract or decouple generalizable features to address these challenges. However, enhancing performance solely through facial features remains challenging without additional informative inputs. To address this, Vision-Language Models (VLMs) with robust generalization capabilities have recently been introduced to FAS. Despite their potential, these VLMs typically adopt a late alignment strategy, relying only on encoder output features for modality alignment, which largely neglects mutual guidance between modalities. To bridge this gap, inspired by recent advancements in prompt learning, we employ learnable prompts and masking as intermediaries to enhance interaction between text and visual modalities, enabling the extraction of more generalizable features. Specifically, we propose ME-FAS, a Modality-Enhanced cross-domain FAS model integrating Prompt Fusion Transfer (PFT) and Text-guided Image Masking (TIM). PFT facilitates the integration of text features with visual information, improving domain adaptability in alignment with the textual context. Meanwhile, TIM leverages text features to mask image patches, directing visual features toward critical generalizable facial information, such as the eyes and mouth. Comprehensive evaluations across multiple benchmarks and various visualizations demonstrate significant performance gains, validating the effectiveness of our proposed approach. Our code and models are available at https://github.com/clpbc/ME-FAS
期刊介绍:
The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features