Wolf in Sheep’s Clothing: Shearing the Camouflage of Malicious Java Components in Maven

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-08-19 DOI:10.1109/TSE.2025.3599732

Yutong Zeng;Cheng Huang;Jiaxuan Han;Jianguo Zhao;Nannan Wang;Genpei Liang;Shuyi Jiang

{"title":"Wolf in Sheep’s Clothing: Shearing the Camouflage of Malicious Java Components in Maven","authors":"Yutong Zeng;Cheng Huang;Jiaxuan Han;Jianguo Zhao;Nannan Wang;Genpei Liang;Shuyi Jiang","doi":"10.1109/TSE.2025.3599732","DOIUrl":null,"url":null,"abstract":"In recent years, software supply chain attacks have become increasingly prevalent, prompting considerable research into detecting malicious packages within relevant repositories. With the popularity bolstered by the widespread adoption of open-source practices, Java become one of the preferred languages among modern developers. However, the issue of malware detection in Java components remains unresolved. Most prior approaches suffer from insufficient code coverage and coarse-grained representation, making them unsuitable for Java components. In this paper, we propose an innovative solution called <sc>Shear</small> tailored for detecting malicious Java components. <sc>Shear</small> firstly analyzes all methods in the component and locates potential malicious code snippets based on sensitive calls, as slice-level analysis provides a better understanding of the specific malicious activities. Secondly, statements depending on sensitive call sites are extracted and embedded into vectors for further detection instead of function-level representation which is coarse-grained facing the dynamic features in Java. The corresponding experimental results show that <sc>Shear</small> effectively identifies the malicious semantics hidden in the code slices by leveraging the neural network model, outperforming currently available tools to a great extent. Through real-world validation, <sc>Shear</small> detected 51 components with malicious characteristics out of 68,273, demonstrating its practical feasibility. This study introduces the first Java malicious component detection method suitable for real-world scenarios, carrying considerable practical significance in bolstering defenses within the software supply chain.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 10","pages":"2847-2863"},"PeriodicalIF":5.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11129930/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, software supply chain attacks have become increasingly prevalent, prompting considerable research into detecting malicious packages within relevant repositories. With the popularity bolstered by the widespread adoption of open-source practices, Java become one of the preferred languages among modern developers. However, the issue of malware detection in Java components remains unresolved. Most prior approaches suffer from insufficient code coverage and coarse-grained representation, making them unsuitable for Java components. In this paper, we propose an innovative solution called Shear tailored for detecting malicious Java components. Shear firstly analyzes all methods in the component and locates potential malicious code snippets based on sensitive calls, as slice-level analysis provides a better understanding of the specific malicious activities. Secondly, statements depending on sensitive call sites are extracted and embedded into vectors for further detection instead of function-level representation which is coarse-grained facing the dynamic features in Java. The corresponding experimental results show that Shear effectively identifies the malicious semantics hidden in the code slices by leveraging the neural network model, outperforming currently available tools to a great extent. Through real-world validation, Shear detected 51 components with malicious characteristics out of 68,273, demonstrating its practical feasibility. This study introduces the first Java malicious component detection method suitable for real-world scenarios, carrying considerable practical significance in bolstering defenses within the software supply chain.

查看原文本刊更多论文

披着羊皮的狼：在Maven中剪掉恶意Java组件的伪装

近年来，软件供应链攻击变得越来越普遍，促使人们对在相关存储库中检测恶意软件包进行了大量研究。随着广泛采用开源实践的普及，Java成为现代开发人员首选的语言之一。然而，Java组件中的恶意软件检测问题仍然没有解决。大多数先前的方法都存在代码覆盖率不足和粗粒度表示的问题，这使得它们不适合Java组件。在本文中，我们提出了一种名为Shear的创新解决方案，用于检测恶意Java组件。Shear首先分析组件中的所有方法，并根据敏感调用定位潜在的恶意代码片段，因为切片级分析可以更好地了解特定的恶意活动。其次，提取依赖于敏感调用点的语句并将其嵌入到向量中以供进一步检测，而不是采用面向Java动态特性的粗粒度函数级表示。实验结果表明，Shear利用神经网络模型有效地识别了隐藏在代码切片中的恶意语义，在很大程度上优于现有的工具。通过实际验证，Shear从68,273个组件中检测出51个具有恶意特征，证明了其实际可行性。本研究介绍了第一个适用于现实场景的Java恶意组件检测方法，对增强软件供应链内的防御具有相当大的实际意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.