{"title":"ClipSwap++: Improved Identity and Attributes Aware Face Swapping","authors":"Phyo Thet Yee;Sudeepta Mishra;Abhinav Dhall","doi":"10.1109/TBIOM.2025.3576111","DOIUrl":null,"url":null,"abstract":"This paper introduces an efficient framework for an identity and attributes aware face swapping. Accurately preserving the source face’s identity while maintaining the target face’s attributes remains a challenge in face swapping due to mismatches between identity and attribute features. To address this, based on our previous work, ClipSwap, we propose an extended version, ClipSwap++, with improved model efficiency with respect to inference time, memory consumption, and more accurate preservation of identity and attributes. Our model is mainly composed of a conditional Generative Adversarial Network and a CLIP-based image encoder to generate realistic face-swapped images. We carefully design our ClipSwap++ with the combination of following three components. First, we introduce the Adaptive Identity Fusion Module (AIFM), which ensures accurate preservation of identity through the careful integration of ArcFace-encoded identity with CLIP-embedded identity. Second, we propose a new decoder architecture with multiple Multi-level Attributes Integration Module (MAIM) to adaptively integrate identity and attribute features, enhancing the preservation of source face’s identity while maintaining the target image’s important attributes. Third, to enhance further the attribute preservation, we introduce Multi-level Attributes Preservation Loss, which calculates the distance between the intermediate and the final output features of the target and swapped images. We perform quantitative and qualitative evaluations using three datasets, and our model obtains the highest identity accuracy (98.93%) with low pose error (1.62) on FaceForensics++ dataset and less inference time (0.30 sec).","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"7 4","pages":"862-875"},"PeriodicalIF":5.0000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on biometrics, behavior, and identity science","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11022728/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces an efficient framework for an identity and attributes aware face swapping. Accurately preserving the source face’s identity while maintaining the target face’s attributes remains a challenge in face swapping due to mismatches between identity and attribute features. To address this, based on our previous work, ClipSwap, we propose an extended version, ClipSwap++, with improved model efficiency with respect to inference time, memory consumption, and more accurate preservation of identity and attributes. Our model is mainly composed of a conditional Generative Adversarial Network and a CLIP-based image encoder to generate realistic face-swapped images. We carefully design our ClipSwap++ with the combination of following three components. First, we introduce the Adaptive Identity Fusion Module (AIFM), which ensures accurate preservation of identity through the careful integration of ArcFace-encoded identity with CLIP-embedded identity. Second, we propose a new decoder architecture with multiple Multi-level Attributes Integration Module (MAIM) to adaptively integrate identity and attribute features, enhancing the preservation of source face’s identity while maintaining the target image’s important attributes. Third, to enhance further the attribute preservation, we introduce Multi-level Attributes Preservation Loss, which calculates the distance between the intermediate and the final output features of the target and swapped images. We perform quantitative and qualitative evaluations using three datasets, and our model obtains the highest identity accuracy (98.93%) with low pose error (1.62) on FaceForensics++ dataset and less inference time (0.30 sec).