Haogang Cheng , Ling Xu , Luwen Huangfu , Chao Liu , Meng Yan , Yan Lei
{"title":"GNPSum: A code summarization enhancement framework based on Graph Node Position","authors":"Haogang Cheng , Ling Xu , Luwen Huangfu , Chao Liu , Meng Yan , Yan Lei","doi":"10.1016/j.infsof.2025.107837","DOIUrl":null,"url":null,"abstract":"<div><div>Code summarization is essential for effectively communicating a code’s core functionality and logic, enhancing software development efficiency, collaboration, and code quality. Traditional work has focused on generating summaries from textual information extracted from the source code. However, these approaches often fail to capture the hierarchical structure critical for effective summarization. To effectively capture the hierarchical structure of the code, which is crucial for accurate summarization, researchers often integrate structural elements such as Syntax Trees (AST) into their models. However, conventional embedding methods struggle to accurately discern the semantic nuances within the code, particularly for nodes with similar content but distinct structural roles. The relative positional information of these nodes, which often conveys semantics absent from the source code itself, is frequently overlooked, limiting the model’s ability to fully exploit the hierarchical and contextual richness inherent in the code structure.</div><div>To overcome these limitations, we propose GNPSum, a code summarization enhancement framework based on the position of the graph node. GNPSum employs a structural combinatorial graph approach (SCG), which extends the AST edges with CFG and DFG to aggregate multimodal information. We introduce a novel positional embedding technique that leverages distances between nodes to reduce semantic ambiguity and guide effective summary generation. Evaluations on extensive Java and Python datasets demonstrate that GNPSum improves 3.30% and 1.82% in the BLEU score, compared to the highest performance baseline. Furthermore, our validation shows that GNPSum significantly enhances the structural comprehension for pre-trained models, resulting in a 1.93% performance boost over models fine-tuned without our framework.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"187 ","pages":"Article 107837"},"PeriodicalIF":4.3000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001764","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Code summarization is essential for effectively communicating a code’s core functionality and logic, enhancing software development efficiency, collaboration, and code quality. Traditional work has focused on generating summaries from textual information extracted from the source code. However, these approaches often fail to capture the hierarchical structure critical for effective summarization. To effectively capture the hierarchical structure of the code, which is crucial for accurate summarization, researchers often integrate structural elements such as Syntax Trees (AST) into their models. However, conventional embedding methods struggle to accurately discern the semantic nuances within the code, particularly for nodes with similar content but distinct structural roles. The relative positional information of these nodes, which often conveys semantics absent from the source code itself, is frequently overlooked, limiting the model’s ability to fully exploit the hierarchical and contextual richness inherent in the code structure.
To overcome these limitations, we propose GNPSum, a code summarization enhancement framework based on the position of the graph node. GNPSum employs a structural combinatorial graph approach (SCG), which extends the AST edges with CFG and DFG to aggregate multimodal information. We introduce a novel positional embedding technique that leverages distances between nodes to reduce semantic ambiguity and guide effective summary generation. Evaluations on extensive Java and Python datasets demonstrate that GNPSum improves 3.30% and 1.82% in the BLEU score, compared to the highest performance baseline. Furthermore, our validation shows that GNPSum significantly enhances the structural comprehension for pre-trained models, resulting in a 1.93% performance boost over models fine-tuned without our framework.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.