{"title":"Key class identification: a comprehensive dataset and a new GNN model","authors":"Shizhou Wang, Yuhang Chen, Liangyu Chen","doi":"10.1007/s10489-025-06574-3","DOIUrl":null,"url":null,"abstract":"<div><p>Program comprehension is a critical task in software maintenance. As the scale of codebases expands, the required human effort increases exponentially. Key Class Identification (KCI) offers an effective solution to this challenge. Despite this, the absence of standardized benchmarks and the lack of robustness in most existing metric-based approaches across different software systems are major obstacles. In this paper, we first construct a comprehensive dataset to objectively evaluate KCI performance. Inspired by ensemble learning, we introduce a voting method to address key class labeling, representing the primary challenge in dataset construction. Additionally, we propose a novel GNN model that leverages graph transformer to capture information from directed class dependency networks for key class identification. Extensive experiments conducted on 170 software systems in our benchmark demonstrate that our approach achieves high accuracy of up to 93.1%, outperforming existing metric-based methods.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 10","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06574-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Program comprehension is a critical task in software maintenance. As the scale of codebases expands, the required human effort increases exponentially. Key Class Identification (KCI) offers an effective solution to this challenge. Despite this, the absence of standardized benchmarks and the lack of robustness in most existing metric-based approaches across different software systems are major obstacles. In this paper, we first construct a comprehensive dataset to objectively evaluate KCI performance. Inspired by ensemble learning, we introduce a voting method to address key class labeling, representing the primary challenge in dataset construction. Additionally, we propose a novel GNN model that leverages graph transformer to capture information from directed class dependency networks for key class identification. Extensive experiments conducted on 170 software systems in our benchmark demonstrate that our approach achieves high accuracy of up to 93.1%, outperforming existing metric-based methods.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.