LMAE4Eth: Generalizable and Robust Ethereum Fraud Detection by Exploring Transaction Semantics and Masked Graph Embedding

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-09-19 DOI:10.1109/TIFS.2025.3612149

Yifan Jia;Yanbin Wang;Jianguo Sun;Ye Tian;Peng Qian

{"title":"LMAE4Eth: Generalizable and Robust Ethereum Fraud Detection by Exploring Transaction Semantics and Masked Graph Embedding","authors":"Yifan Jia;Yanbin Wang;Jianguo Sun;Ye Tian;Peng Qian","doi":"10.1109/TIFS.2025.3612149","DOIUrl":null,"url":null,"abstract":"As Ethereum confronts increasingly sophisticated fraud threats, recent research seeks to improve fraud account detection by leveraging advanced pre-trained Transformer or self-supervised graph neural network. However, current Transformer-based methods rely on context-independent, numerical transaction sequences, failing to capture semantic of account transactions. Furthermore, the pervasive homogeneity in Ethereum transaction records renders it challenging to learn discriminative account embeddings. Moreover, current self-supervised graph learning methods primarily learn node representations through graph reconstruction, resulting in suboptimal performance for node-level tasks like fraud account detection, while these methods also encounter scalability challenges. To tackle these challenges, we propose LMAE4Eth, a multi-view learning framework that fuses transaction semantics, masked graph embedding, and expert knowledge. We first propose a transaction-token contrastive language model (TxCLM) that transforms context-independent numerical transaction records into logically cohesive linguistic representations, and leverages language modeling to learn transaction semantics. To clearly characterize the semantic differences between accounts, we also use a token-aware contrastive learning pre-training objective, which, together with the masked transaction model pre-training objective, learns high-expressive account representations. We then propose a masked account graph autoencoder (MAGAE) using generative self-supervised learning, which achieves superior node-level account detection by focusing on reconstructing account node features rather than graph structure. To enable MAGAE to scale for large-scale training, we propose to integrate layer-neighbor sampling into the graph, which reduces the number of sampled vertices by several times without compromising training quality. Additionally, we initialize the account nodes in the graph with expert-engineered features to inject empirical and statistical knowledge into the model. Finally, using a cross-attention fusion network, we unify the embeddings of TxCLM and MAGAE to leverage the benefits of both. We evaluate our method against 21 baseline approaches on three datasets. Experimental results show that our method improves the F1-score by over 10% at most compared with the best baseline. Furthermore, we observe from three datasets that the proposed method demonstrates strong generalization ability compared to previous work. Our source code is avaliable at: <uri>https://github.com/lmae4eth/LMAE4Eth</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"10260-10274"},"PeriodicalIF":8.0000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11173945/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

As Ethereum confronts increasingly sophisticated fraud threats, recent research seeks to improve fraud account detection by leveraging advanced pre-trained Transformer or self-supervised graph neural network. However, current Transformer-based methods rely on context-independent, numerical transaction sequences, failing to capture semantic of account transactions. Furthermore, the pervasive homogeneity in Ethereum transaction records renders it challenging to learn discriminative account embeddings. Moreover, current self-supervised graph learning methods primarily learn node representations through graph reconstruction, resulting in suboptimal performance for node-level tasks like fraud account detection, while these methods also encounter scalability challenges. To tackle these challenges, we propose LMAE4Eth, a multi-view learning framework that fuses transaction semantics, masked graph embedding, and expert knowledge. We first propose a transaction-token contrastive language model (TxCLM) that transforms context-independent numerical transaction records into logically cohesive linguistic representations, and leverages language modeling to learn transaction semantics. To clearly characterize the semantic differences between accounts, we also use a token-aware contrastive learning pre-training objective, which, together with the masked transaction model pre-training objective, learns high-expressive account representations. We then propose a masked account graph autoencoder (MAGAE) using generative self-supervised learning, which achieves superior node-level account detection by focusing on reconstructing account node features rather than graph structure. To enable MAGAE to scale for large-scale training, we propose to integrate layer-neighbor sampling into the graph, which reduces the number of sampled vertices by several times without compromising training quality. Additionally, we initialize the account nodes in the graph with expert-engineered features to inject empirical and statistical knowledge into the model. Finally, using a cross-attention fusion network, we unify the embeddings of TxCLM and MAGAE to leverage the benefits of both. We evaluate our method against 21 baseline approaches on three datasets. Experimental results show that our method improves the F1-score by over 10% at most compared with the best baseline. Furthermore, we observe from three datasets that the proposed method demonstrates strong generalization ability compared to previous work. Our source code is avaliable at: https://github.com/lmae4eth/LMAE4Eth

查看原文本刊更多论文

LMAE4Eth：通过探索交易语义和掩码图嵌入的可推广和鲁棒的以太坊欺诈检测

随着以太坊面临越来越复杂的欺诈威胁，最近的研究试图通过利用先进的预训练变压器或自监督图神经网络来改进欺诈账户检测。然而，当前基于transformer的方法依赖于与上下文无关的数字事务序列，无法捕获帐户事务的语义。此外，以太坊交易记录中普遍存在的同质性使得学习歧视性账户嵌入具有挑战性。此外，目前的自监督图学习方法主要是通过图重构来学习节点表示，导致节点级任务（如欺诈账户检测）的性能不是最优，同时这些方法也面临可扩展性的挑战。为了应对这些挑战，我们提出了LMAE4Eth，这是一个融合了事务语义、掩码图嵌入和专家知识的多视图学习框架。我们首先提出了一个事务-令牌对比语言模型（TxCLM），它将上下文无关的数字事务记录转换为逻辑上有凝聚力的语言表示，并利用语言建模来学习事务语义。为了清晰地表征账户之间的语义差异，我们还使用了标记感知的对比学习预训练目标，该目标与掩码交易模型预训练目标一起学习高表达的账户表征。然后，我们提出了一种使用生成式自监督学习的掩码帐户图自编码器（MAGAE），该算法通过专注于重建帐户节点特征而不是图结构来实现优越的节点级帐户检测。为了使MAGAE能够扩展到大规模训练，我们建议将层邻近采样集成到图中，这样可以在不影响训练质量的情况下将采样顶点的数量减少几倍。此外，我们用专家工程特征初始化图中的帐户节点，将经验和统计知识注入模型。最后，利用交叉注意力融合网络，将TxCLM和MAGAE的嵌入统一起来，充分利用两者的优势。我们在三个数据集上对21种基线方法进行了评估。实验结果表明，与最佳基线相比，该方法最多可将f1分数提高10%以上。此外，我们从三个数据集中观察到，与先前的工作相比，所提出的方法具有较强的泛化能力。我们的源代码可在：https://github.com/lmae4eth/LMAE4Eth

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features