{"title":"Towards Improving the Performance of Comment Generation Models by Using Bytecode Information","authors":"Yuan Huang;Jinbo Huang;Xiangping Chen;Zibin Zheng","doi":"10.1109/TSE.2024.3523713","DOIUrl":null,"url":null,"abstract":"Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 2","pages":"503-520"},"PeriodicalIF":6.5000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836147/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.