Giovanni Ciaramella , Fabio Martinelli , Christian Peluso , Antonella Santone , Francesco Mercaldo
{"title":"A method for real-world privacy-preserving Android malware detection through Federated Machine Learning","authors":"Giovanni Ciaramella , Fabio Martinelli , Christian Peluso , Antonella Santone , Francesco Mercaldo","doi":"10.1016/j.infsof.2025.107892","DOIUrl":null,"url":null,"abstract":"<div><div>Privacy is one of the most critical issues associated with spreading the Internet of Things and Internet of Everything devices. Over the years, several methods have been introduced to address this phenomenon. In 2017, Google introduced the concept of Federated Machine Learning. This paradigm allows models to be trained collaboratively across multiple decentralized devices or servers, holding local data samples without exchanging them. This approach enhances data privacy and security by ensuring raw data remains on local devices while only model updates are shared and aggregated. This paper presents a privacy-preserving Android malware detector based on Federated Machine Learning. As a first step, we built a dataset comprising over 40,000 Android applications, including trusted and malicious (belonging to 71 malware families) samples. Afterward, we conducted experiments leveraging three different architectures by exploiting the CIFAR-10 and the ImageNet datasets, employing hyperparameters determined through a Grid Search algorithm by exploiting 40 clients. Moreover, the experimental analysis uses two distributions: Independent and identically distributed and non-independent and identically distributed data. To conclude the Federated Machine Learning experiments, we trained models for each architecture, with both weight types and distribution models, by applying the Clipping Norm Aggregator. The results exhibit interesting performances with Independent and identically distributed data, achieving an accuracy of 0.873 without normalization and 0.877 with the Clipping Norm aggregator. However, with non-independent and identically distributed data, the model accuracy equals 0.865 without normalization, 0.864 with the Clipping Norm aggregator using Custom MobileNet 2. In conclusion, to compare Federated Machine Learning with a centralized training approach, we trained several models adopting the same dataset, dataset splitting, and architectures, achieving an accuracy of 0.944 using InceptionV3. The outcomes show that the proposed method can provide engaging performances in privacy-preserving Android malware detection.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"189 ","pages":"Article 107892"},"PeriodicalIF":4.3000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925002319","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Privacy is one of the most critical issues associated with spreading the Internet of Things and Internet of Everything devices. Over the years, several methods have been introduced to address this phenomenon. In 2017, Google introduced the concept of Federated Machine Learning. This paradigm allows models to be trained collaboratively across multiple decentralized devices or servers, holding local data samples without exchanging them. This approach enhances data privacy and security by ensuring raw data remains on local devices while only model updates are shared and aggregated. This paper presents a privacy-preserving Android malware detector based on Federated Machine Learning. As a first step, we built a dataset comprising over 40,000 Android applications, including trusted and malicious (belonging to 71 malware families) samples. Afterward, we conducted experiments leveraging three different architectures by exploiting the CIFAR-10 and the ImageNet datasets, employing hyperparameters determined through a Grid Search algorithm by exploiting 40 clients. Moreover, the experimental analysis uses two distributions: Independent and identically distributed and non-independent and identically distributed data. To conclude the Federated Machine Learning experiments, we trained models for each architecture, with both weight types and distribution models, by applying the Clipping Norm Aggregator. The results exhibit interesting performances with Independent and identically distributed data, achieving an accuracy of 0.873 without normalization and 0.877 with the Clipping Norm aggregator. However, with non-independent and identically distributed data, the model accuracy equals 0.865 without normalization, 0.864 with the Clipping Norm aggregator using Custom MobileNet 2. In conclusion, to compare Federated Machine Learning with a centralized training approach, we trained several models adopting the same dataset, dataset splitting, and architectures, achieving an accuracy of 0.944 using InceptionV3. The outcomes show that the proposed method can provide engaging performances in privacy-preserving Android malware detection.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.