GNPSum: A code summarization enhancement framework based on Graph Node Position

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-07-19 DOI:10.1016/j.infsof.2025.107837

Haogang Cheng , Ling Xu , Luwen Huangfu , Chao Liu , Meng Yan , Yan Lei

{"title":"GNPSum: A code summarization enhancement framework based on Graph Node Position","authors":"Haogang Cheng , Ling Xu , Luwen Huangfu , Chao Liu , Meng Yan , Yan Lei","doi":"10.1016/j.infsof.2025.107837","DOIUrl":null,"url":null,"abstract":"<div><div>Code summarization is essential for effectively communicating a code’s core functionality and logic, enhancing software development efficiency, collaboration, and code quality. Traditional work has focused on generating summaries from textual information extracted from the source code. However, these approaches often fail to capture the hierarchical structure critical for effective summarization. To effectively capture the hierarchical structure of the code, which is crucial for accurate summarization, researchers often integrate structural elements such as Syntax Trees (AST) into their models. However, conventional embedding methods struggle to accurately discern the semantic nuances within the code, particularly for nodes with similar content but distinct structural roles. The relative positional information of these nodes, which often conveys semantics absent from the source code itself, is frequently overlooked, limiting the model’s ability to fully exploit the hierarchical and contextual richness inherent in the code structure.</div><div>To overcome these limitations, we propose GNPSum, a code summarization enhancement framework based on the position of the graph node. GNPSum employs a structural combinatorial graph approach (SCG), which extends the AST edges with CFG and DFG to aggregate multimodal information. We introduce a novel positional embedding technique that leverages distances between nodes to reduce semantic ambiguity and guide effective summary generation. Evaluations on extensive Java and Python datasets demonstrate that GNPSum improves 3.30% and 1.82% in the BLEU score, compared to the highest performance baseline. Furthermore, our validation shows that GNPSum significantly enhances the structural comprehension for pre-trained models, resulting in a 1.93% performance boost over models fine-tuned without our framework.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"187 ","pages":"Article 107837"},"PeriodicalIF":4.3000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001764","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Code summarization is essential for effectively communicating a code’s core functionality and logic, enhancing software development efficiency, collaboration, and code quality. Traditional work has focused on generating summaries from textual information extracted from the source code. However, these approaches often fail to capture the hierarchical structure critical for effective summarization. To effectively capture the hierarchical structure of the code, which is crucial for accurate summarization, researchers often integrate structural elements such as Syntax Trees (AST) into their models. However, conventional embedding methods struggle to accurately discern the semantic nuances within the code, particularly for nodes with similar content but distinct structural roles. The relative positional information of these nodes, which often conveys semantics absent from the source code itself, is frequently overlooked, limiting the model’s ability to fully exploit the hierarchical and contextual richness inherent in the code structure.

To overcome these limitations, we propose GNPSum, a code summarization enhancement framework based on the position of the graph node. GNPSum employs a structural combinatorial graph approach (SCG), which extends the AST edges with CFG and DFG to aggregate multimodal information. We introduce a novel positional embedding technique that leverages distances between nodes to reduce semantic ambiguity and guide effective summary generation. Evaluations on extensive Java and Python datasets demonstrate that GNPSum improves 3.30% and 1.82% in the BLEU score, compared to the highest performance baseline. Furthermore, our validation shows that GNPSum significantly enhances the structural comprehension for pre-trained models, resulting in a 1.93% performance boost over models fine-tuned without our framework.

查看原文本刊更多论文

GNPSum：一个基于图节点位置的代码摘要增强框架

代码摘要对于有效地沟通代码的核心功能和逻辑、增强软件开发效率、协作和代码质量是必不可少的。传统的工作侧重于从源代码中提取文本信息生成摘要。然而，这些方法往往不能捕捉到对有效总结至关重要的层次结构。为了有效地捕获代码的层次结构，这是准确总结的关键，研究人员经常将语法树（AST）等结构元素集成到他们的模型中。然而，传统的嵌入方法很难准确地识别代码中的语义细微差别，特别是对于内容相似但结构角色不同的节点。这些节点的相对位置信息通常传递源代码本身所没有的语义，这些信息经常被忽略，从而限制了模型充分利用代码结构中固有的层次和上下文丰富性的能力。为了克服这些限制，我们提出了基于图节点位置的代码摘要增强框架GNPSum。GNPSum采用结构组合图方法（SCG），利用CFG和DFG扩展AST边来聚合多模态信息。我们引入了一种新的位置嵌入技术，利用节点之间的距离来减少语义歧义并指导有效的摘要生成。对广泛的Java和Python数据集的评估表明，与最高性能基线相比，GNPSum在BLEU得分方面提高了3.30%和1.82%。此外，我们的验证表明，GNPSum显著增强了预训练模型的结构理解能力，与没有我们框架的模型相比，性能提升了1.93%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.