Federated micro-expression mining and multi-modal metadata fusion for Deepfake fraud detection in ubiquitous financial video-KYC systems at IoT network
{"title":"Federated micro-expression mining and multi-modal metadata fusion for Deepfake fraud detection in ubiquitous financial video-KYC systems at IoT network","authors":"Romil Rawat , Anjali Rawat , Shweta Gupta , A. Samson Arun Raj , T.M. Thiyagu , Hitesh Rawat , Anand Rajavat","doi":"10.1016/j.fraope.2026.100523","DOIUrl":null,"url":null,"abstract":"<div><div><strong>Introduction & Problem Statement-</strong> The increasing sophistication of AI-generated deepfakes poses significant challenges for financial video-KYC systems, where identity verification relies on accurate and real-time analysis of user biometrics. Traditional centralized and unimodal detection models struggle to balance accuracy, privacy, and deployment scalability, particularly across heterogeneous IoT edge devices. <strong>Need for Research-</strong>There is a pressing need for privacy-preserving, scalable, and robust deepfake detection mechanisms capable of identifying subtle manipulations in real-world financial environments. Current solutions often fail under domain-shift conditions, low-resolution inputs, or in scenarios involving complex micro-expression and behavioral cues. <strong>Proposed Work & Objective-</strong> This research proposes the <strong>Federated Micro-Expression Mining and Multi-Modal Metadata Fusion (FED-MEMF)</strong> framework, designed to accurately detect deepfake fraud in decentralized video-KYC systems. The objectives are to (i) enhance detection accuracy by leveraging facial micro-expression dynamics, audio signals, and session metadata, and (ii) preserve user privacy through federated learning while ensuring low-latency real-time inference. <strong>Novelty-</strong> The novelty lies in integrating fine-grained micro-expression analysis with behavioral metadata fusion in a <strong>federated learning environment</strong>, combined with cross-modal attention mechanisms. This approach enables robust detection across multiple datasets while maintaining privacy and edge-device compatibility. <strong>Method-</strong> The framework employs modality-specific encoders—μ-Transformer for micro-expressions, CNN for audio, and LSTM for metadata—with features fused via a cross-modal attention engine. Federated Averaging (FedAvg) aggregates local model updates from IoT edge devices without transferring sensitive data. Quantization and hardware optimizations enable real-time performance on low-power devices. <strong>Dataset-</strong> Experiments utilized <strong>FaceForensics++, CAS(ME)^2</strong>, and a proprietary <strong>KYC-FinVox2024</strong> dataset comprising video, audio, and metadata streams, including micro-expression labels, to evaluate both intra- and cross-dataset performance. <strong>Results-</strong> FED-MEMF achieved an overall accuracy of <strong>98.7%</strong>, F1-score of <strong>0.987</strong>, AUC of <strong>0.996</strong>, and inference latency of <strong>82</strong> <strong>ms</strong>, outperforming XceptionNet, EfficientNet-B4, and CNN+LSTM baselines. Multi-modal fusion significantly reduced false positives and false negatives, demonstrating robustness under domain-shift conditions. <strong>Conclusion & Future Work-</strong> FED-MEMF provides a <strong>privacy-conscious, real-time, and scalable solution</strong> for deepfake detection in financial video-KYC applications. Future directions include multilingual audio-visual alignment, blockchain-enabled federated auditing, explainable AI integration, and deployment in other regulatory-sensitive sectors such as e-governance, healthcare, and remote education verification.</div></div>","PeriodicalId":100554,"journal":{"name":"Franklin Open","volume":"14 ","pages":"Article 100523"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Franklin Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2773186326000393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/4 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction & Problem Statement- The increasing sophistication of AI-generated deepfakes poses significant challenges for financial video-KYC systems, where identity verification relies on accurate and real-time analysis of user biometrics. Traditional centralized and unimodal detection models struggle to balance accuracy, privacy, and deployment scalability, particularly across heterogeneous IoT edge devices. Need for Research-There is a pressing need for privacy-preserving, scalable, and robust deepfake detection mechanisms capable of identifying subtle manipulations in real-world financial environments. Current solutions often fail under domain-shift conditions, low-resolution inputs, or in scenarios involving complex micro-expression and behavioral cues. Proposed Work & Objective- This research proposes the Federated Micro-Expression Mining and Multi-Modal Metadata Fusion (FED-MEMF) framework, designed to accurately detect deepfake fraud in decentralized video-KYC systems. The objectives are to (i) enhance detection accuracy by leveraging facial micro-expression dynamics, audio signals, and session metadata, and (ii) preserve user privacy through federated learning while ensuring low-latency real-time inference. Novelty- The novelty lies in integrating fine-grained micro-expression analysis with behavioral metadata fusion in a federated learning environment, combined with cross-modal attention mechanisms. This approach enables robust detection across multiple datasets while maintaining privacy and edge-device compatibility. Method- The framework employs modality-specific encoders—μ-Transformer for micro-expressions, CNN for audio, and LSTM for metadata—with features fused via a cross-modal attention engine. Federated Averaging (FedAvg) aggregates local model updates from IoT edge devices without transferring sensitive data. Quantization and hardware optimizations enable real-time performance on low-power devices. Dataset- Experiments utilized FaceForensics++, CAS(ME)^2, and a proprietary KYC-FinVox2024 dataset comprising video, audio, and metadata streams, including micro-expression labels, to evaluate both intra- and cross-dataset performance. Results- FED-MEMF achieved an overall accuracy of 98.7%, F1-score of 0.987, AUC of 0.996, and inference latency of 82ms, outperforming XceptionNet, EfficientNet-B4, and CNN+LSTM baselines. Multi-modal fusion significantly reduced false positives and false negatives, demonstrating robustness under domain-shift conditions. Conclusion & Future Work- FED-MEMF provides a privacy-conscious, real-time, and scalable solution for deepfake detection in financial video-KYC applications. Future directions include multilingual audio-visual alignment, blockchain-enabled federated auditing, explainable AI integration, and deployment in other regulatory-sensitive sectors such as e-governance, healthcare, and remote education verification.