ADTDroid: Leveraging API description and TCP based active learning for Android malware detection

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-10-14 DOI:10.1016/j.infsof.2025.107930

Zhen Liu , Ruoyu Wang , Wenbin Zhang

{"title":"ADTDroid: Leveraging API description and TCP based active learning for Android malware detection","authors":"Zhen Liu , Ruoyu Wang , Wenbin Zhang","doi":"10.1016/j.infsof.2025.107930","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Extensive research has been conducted on neural network-based Android malware detection models to safeguard the Android software ecosystem. However, the efficacy of detection models may decline over time due to the continuous evolution of malicious behaviors, a phenomenon referred to as the model aging problem.</div></div><div><h3>Objective:</h3><div>To tackle this problem, existing researches primarily focus on API semantic feature learning and active learning. However, a major challenge in feature learning is the continuous updating of APIs. Additionally, the over-confidence problem in neural networks exacerbates the challenge of selecting uncertain samples during active learning. To handle these challenges, this paper proposes a novel android malware detection method called ADTDroid. It aims to enhance the performance of malware detection model against the ongoing API updating and malware evolution.</div></div><div><h3>Method:</h3><div>In this paper, we present a sensitive event graph based feature extraction approach that prioritizes suspicious APIs. To derive API embeddings for feature vector extraction, we propose learning these embeddings directly from API descriptions provided in official Android development documentation. This method facilitates the immediate acquisition of embeddings for updated APIs from the documentation. Furthermore, we propose a True Class Probability(TCP)-based confidence score to identify uncertain samples for model retraining. These samples exhibit genuine uncertainty, thereby enhancing the model’s adaptability to evolving data.</div></div><div><h3>Results:</h3><div>Through extensive experimentation on large-scale real-world datasets covering the period from 2013 to 2022, our method achieves significant improvements in the F-score of malware detection. Compared to existing active learning-based approaches, our method achieves relative improvements of approximately 10% over APIGraph and 8.1% over contrastive autoencoder techniques.</div></div><div><h3>Conclusion:</h3><div>ADTDroid can enhance the performance of feature extraction in cases of model aging. It can also improve the selection of uncertain samples to adapt the malware detection model to new data.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"189 ","pages":"Article 107930"},"PeriodicalIF":4.3000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002691","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Extensive research has been conducted on neural network-based Android malware detection models to safeguard the Android software ecosystem. However, the efficacy of detection models may decline over time due to the continuous evolution of malicious behaviors, a phenomenon referred to as the model aging problem.

Objective:

To tackle this problem, existing researches primarily focus on API semantic feature learning and active learning. However, a major challenge in feature learning is the continuous updating of APIs. Additionally, the over-confidence problem in neural networks exacerbates the challenge of selecting uncertain samples during active learning. To handle these challenges, this paper proposes a novel android malware detection method called ADTDroid. It aims to enhance the performance of malware detection model against the ongoing API updating and malware evolution.

Method:

In this paper, we present a sensitive event graph based feature extraction approach that prioritizes suspicious APIs. To derive API embeddings for feature vector extraction, we propose learning these embeddings directly from API descriptions provided in official Android development documentation. This method facilitates the immediate acquisition of embeddings for updated APIs from the documentation. Furthermore, we propose a True Class Probability(TCP)-based confidence score to identify uncertain samples for model retraining. These samples exhibit genuine uncertainty, thereby enhancing the model’s adaptability to evolving data.

Results:

Through extensive experimentation on large-scale real-world datasets covering the period from 2013 to 2022, our method achieves significant improvements in the F-score of malware detection. Compared to existing active learning-based approaches, our method achieves relative improvements of approximately 10% over APIGraph and 8.1% over contrastive autoencoder techniques.

Conclusion:

ADTDroid can enhance the performance of feature extraction in cases of model aging. It can also improve the selection of uncertain samples to adapt the malware detection model to new data.

查看原文本刊更多论文

ADTDroid：利用API描述和基于TCP的主动学习来检测Android恶意软件

背景：为了保护Android软件生态系统，基于神经网络的Android恶意软件检测模型得到了广泛的研究。然而，由于恶意行为的不断演变，检测模型的有效性可能会随着时间的推移而下降，这种现象被称为模型老化问题。为了解决这一问题，现有的研究主要集中在API语义特征学习和主动学习方面。然而，特征学习的一个主要挑战是api的不断更新。此外，神经网络中的过度自信问题加剧了在主动学习过程中选择不确定样本的挑战。为了解决这些问题，本文提出了一种新的android恶意软件检测方法——ADTDroid。它旨在提高恶意软件检测模型的性能，以应对不断更新的API和恶意软件的演变。方法：在本文中，我们提出了一种基于敏感事件图的特征提取方法，该方法优先考虑可疑api。为了获得用于特征向量提取的API嵌入，我们建议直接从官方Android开发文档中提供的API描述中学习这些嵌入。这种方法有助于从文档中立即获取更新api的嵌入。此外，我们提出了一个基于真类概率（TCP）的置信度评分来识别模型再训练的不确定样本。这些样本表现出真正的不确定性，从而增强了模型对不断变化的数据的适应性。结果：通过在2013年至2022年的大规模真实数据集上进行大量实验，我们的方法在恶意软件检测的f分上取得了显著的提高。与现有的基于主动学习的方法相比，我们的方法比APIGraph实现了大约10%的相对改进，比对比自编码器技术实现了8.1%的相对改进。结论：ADTDroid可以提高模型老化情况下的特征提取性能。它还可以改进不确定样本的选择，使恶意软件检测模型适应新的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.