隐私保护下非平衡软件缺陷预测的联邦过采样学习框架

IF 5.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Reliability Pub Date : 2025-01-15 DOI:10.1109/TR.2024.3524064

Xiaowen Hu;Ming Zheng;Rui Zhu;Xuan Zhang;Zhi Jin

{"title":"隐私保护下非平衡软件缺陷预测的联邦过采样学习框架","authors":"Xiaowen Hu;Ming Zheng;Rui Zhu;Xuan Zhang;Zhi Jin","doi":"10.1109/TR.2024.3524064","DOIUrl":null,"url":null,"abstract":"Software defect prediction technology can discover potential errors or hidden defects by establishing prediction models before the use of products in the field of software engineering, so as to reduce subsequent problems and improve software quality and security. However, building predictive models requires enough software defect dataset support, especially defect samples. Due to the involvement of confidential information from various organizations or enterprises, software defect data cannot be shared and effectively utilized. Therefore, to achieve collaborative training of multiparty shared software defect prediction models while keeping the data local to various organizations, we made the federated learning framework for the issue of software defect prediction. Meanwhile, the nondefect and defect instances in software defect datasets are usually imbalanced, which can seriously affect the software defect prediction performance of the model. Therefore, this study designs a novel federated oversampling learning framework Fed-OLF. First, the TabDiT method based on deep generative model is proposed in Fed-OLF to expand and rebalance the local imbalanced software defect dataset of each client with a certain degree of privacy protection. Second, a parameter aggregation strategy based on local information entropy is proposed in Fed-OLF to further optimize the parameter aggregation effect of the global shared model, thereby achieving better model performance. We conduct extensive experiments on the PROMISE dataset and the NASA Promise repository, and experimental results on the PROMISE dataset and the NASA Promise repository show that, the proposed Fed-OLF exhibits better predictive performance under the F1-score, G-mean, and AUC metrics when compared with the advanced baseline methods. In addition, we verify that both the TabDiT method and the parameter aggregation strategy based on local information entropy in Fed-OLF are useful, and the combination of them can more effectively improve model performance.","PeriodicalId":56305,"journal":{"name":"IEEE Transactions on Reliability","volume":"74 3","pages":"3266-3280"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fed-OLF: Federated Oversampling Learning Framework for Imbalanced Software Defect Prediction Under Privacy Protection\",\"authors\":\"Xiaowen Hu;Ming Zheng;Rui Zhu;Xuan Zhang;Zhi Jin\",\"doi\":\"10.1109/TR.2024.3524064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software defect prediction technology can discover potential errors or hidden defects by establishing prediction models before the use of products in the field of software engineering, so as to reduce subsequent problems and improve software quality and security. However, building predictive models requires enough software defect dataset support, especially defect samples. Due to the involvement of confidential information from various organizations or enterprises, software defect data cannot be shared and effectively utilized. Therefore, to achieve collaborative training of multiparty shared software defect prediction models while keeping the data local to various organizations, we made the federated learning framework for the issue of software defect prediction. Meanwhile, the nondefect and defect instances in software defect datasets are usually imbalanced, which can seriously affect the software defect prediction performance of the model. Therefore, this study designs a novel federated oversampling learning framework Fed-OLF. First, the TabDiT method based on deep generative model is proposed in Fed-OLF to expand and rebalance the local imbalanced software defect dataset of each client with a certain degree of privacy protection. Second, a parameter aggregation strategy based on local information entropy is proposed in Fed-OLF to further optimize the parameter aggregation effect of the global shared model, thereby achieving better model performance. We conduct extensive experiments on the PROMISE dataset and the NASA Promise repository, and experimental results on the PROMISE dataset and the NASA Promise repository show that, the proposed Fed-OLF exhibits better predictive performance under the F1-score, G-mean, and AUC metrics when compared with the advanced baseline methods. In addition, we verify that both the TabDiT method and the parameter aggregation strategy based on local information entropy in Fed-OLF are useful, and the combination of them can more effectively improve model performance.\",\"PeriodicalId\":56305,\"journal\":{\"name\":\"IEEE Transactions on Reliability\",\"volume\":\"74 3\",\"pages\":\"3266-3280\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Reliability\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10842949/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Reliability","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10842949/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

软件缺陷预测技术在软件工程领域，通过在产品使用前建立预测模型，发现潜在的错误或隐藏的缺陷，从而减少后续问题，提高软件质量和安全性。然而，构建预测模型需要足够的软件缺陷数据集支持，特别是缺陷样本。由于涉及到来自不同组织或企业的机密信息，软件缺陷数据无法被共享和有效利用。因此，为了实现多方共享软件缺陷预测模型的协同训练，同时保持数据对各个组织的局域性，我们针对软件缺陷预测问题构建了联邦学习框架。同时，软件缺陷数据集中的非缺陷和缺陷实例往往不平衡，严重影响模型的软件缺陷预测性能。因此，本研究设计了一种新的联邦过采样学习框架Fed-OLF。首先，在Fed-OLF中提出基于深度生成模型的TabDiT方法，对每个客户端的局部不平衡软件缺陷数据集进行扩展和再平衡，并保证一定程度的隐私保护。其次，在Fed-OLF中提出一种基于局部信息熵的参数聚合策略，进一步优化全局共享模型的参数聚合效果，从而获得更好的模型性能。我们在PROMISE数据集和NASA PROMISE存储库上进行了大量实验，实验结果表明，与先进的基线方法相比，本文提出的Fed-OLF在F1-score、G-mean和AUC指标下具有更好的预测性能。此外，我们验证了在Fed-OLF中TabDiT方法和基于局部信息熵的参数聚合策略都是有用的，它们的结合可以更有效地提高模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fed-OLF: Federated Oversampling Learning Framework for Imbalanced Software Defect Prediction Under Privacy Protection

Software defect prediction technology can discover potential errors or hidden defects by establishing prediction models before the use of products in the field of software engineering, so as to reduce subsequent problems and improve software quality and security. However, building predictive models requires enough software defect dataset support, especially defect samples. Due to the involvement of confidential information from various organizations or enterprises, software defect data cannot be shared and effectively utilized. Therefore, to achieve collaborative training of multiparty shared software defect prediction models while keeping the data local to various organizations, we made the federated learning framework for the issue of software defect prediction. Meanwhile, the nondefect and defect instances in software defect datasets are usually imbalanced, which can seriously affect the software defect prediction performance of the model. Therefore, this study designs a novel federated oversampling learning framework Fed-OLF. First, the TabDiT method based on deep generative model is proposed in Fed-OLF to expand and rebalance the local imbalanced software defect dataset of each client with a certain degree of privacy protection. Second, a parameter aggregation strategy based on local information entropy is proposed in Fed-OLF to further optimize the parameter aggregation effect of the global shared model, thereby achieving better model performance. We conduct extensive experiments on the PROMISE dataset and the NASA Promise repository, and experimental results on the PROMISE dataset and the NASA Promise repository show that, the proposed Fed-OLF exhibits better predictive performance under the F1-score, G-mean, and AUC metrics when compared with the advanced baseline methods. In addition, we verify that both the TabDiT method and the parameter aggregation strategy based on local information entropy in Fed-OLF are useful, and the combination of them can more effectively improve model performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Reliability 工程技术-工程：电子与电气

CiteScore

12.20

自引率

8.50%

发文量

153

审稿时长

7.5 months

期刊介绍： IEEE Transactions on Reliability is a refereed journal for the reliability and allied disciplines including, but not limited to, maintainability, physics of failure, life testing, prognostics, design and manufacture for reliability, reliability for systems of systems, network availability, mission success, warranty, safety, and various measures of effectiveness. Topics eligible for publication range from hardware to software, from materials to systems, from consumer and industrial devices to manufacturing plants, from individual items to networks, from techniques for making things better to ways of predicting and measuring behavior in the field. As an engineering subject that supports new and existing technologies, we constantly expand into new areas of the assurance sciences.