Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization

2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP) Pub Date : 2022-11-25 DOI:10.1109/PAAP56126.2022.10010698

Zixuan Song, Xiuwei Shang, Mengxuan Li, Rong Chen, Hui Li, Shikai Guo

{"title":"Do Not Have Enough Data? An Easy Data Augmentation for Code Summarization","authors":"Zixuan Song, Xiuwei Shang, Mengxuan Li, Rong Chen, Hui Li, Shikai Guo","doi":"10.1109/PAAP56126.2022.10010698","DOIUrl":null,"url":null,"abstract":"Code comments improve the readability and intelligibility of codes, which can help developers understand programs and improve the efficiency of the software maintenance and evolution process. Unfortunately, code comments are often mismatched, missing, or outdated in software projects, which negatively affects the efficiency of developers to infer the functionality from source code and affect the efficiency of software maintenance and evolution. To solve this problem, many source code summarization algorithms have been proposed. However, these methods usually try to collect a large data set which contains the mapping between code comments and source code to train models. Hence, the effectiveness of the models often rely on the quality of the training data. There are two limitations for the training sets: the insufficient data collection limitation (i.e., generate a large amount of noises-free training data) and data distribution bias limitation (i.e., generate training data for infrequently used methods). To address this issues, we have proposed a data augmentation method for code comments, named CDA-CS. Extensive experiments on Java and Python projects collected from GitHub are conducted to evaluate the performance of CDA-CS. Training models on the augmented dataset, the state-of-the-art algorithms can easily get a further 1.37% to 2.24% improvement in terms of different evaluation metrics (i.e., BLEU-4, METEOR, ROUGH_L) with no additional cost.","PeriodicalId":336339,"journal":{"name":"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PAAP56126.2022.10010698","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Code comments improve the readability and intelligibility of codes, which can help developers understand programs and improve the efficiency of the software maintenance and evolution process. Unfortunately, code comments are often mismatched, missing, or outdated in software projects, which negatively affects the efficiency of developers to infer the functionality from source code and affect the efficiency of software maintenance and evolution. To solve this problem, many source code summarization algorithms have been proposed. However, these methods usually try to collect a large data set which contains the mapping between code comments and source code to train models. Hence, the effectiveness of the models often rely on the quality of the training data. There are two limitations for the training sets: the insufficient data collection limitation (i.e., generate a large amount of noises-free training data) and data distribution bias limitation (i.e., generate training data for infrequently used methods). To address this issues, we have proposed a data augmentation method for code comments, named CDA-CS. Extensive experiments on Java and Python projects collected from GitHub are conducted to evaluate the performance of CDA-CS. Training models on the augmented dataset, the state-of-the-art algorithms can easily get a further 1.37% to 2.24% improvement in terms of different evaluation metrics (i.e., BLEU-4, METEOR, ROUGH_L) with no additional cost.

查看原文本刊更多论文

没有足够的数据?一个简单的数据增强代码汇总

代码注释提高了代码的可读性和可理解性，可以帮助开发人员理解程序，提高软件维护和开发过程的效率。不幸的是，在软件项目中，代码注释经常不匹配、缺失或过时，这对开发人员从源代码推断功能的效率产生了负面影响，并影响了软件维护和发展的效率。为了解决这个问题，人们提出了许多源代码汇总算法。然而，这些方法通常试图收集一个包含代码注释和源代码之间映射的大数据集来训练模型。因此，模型的有效性往往依赖于训练数据的质量。训练集有两个限制:数据收集不足的限制(即生成大量无噪声的训练数据)和数据分布偏差的限制(即生成不常用方法的训练数据)。为了解决这个问题，我们提出了一种用于代码注释的数据增强方法，命名为CDA-CS。在从GitHub收集的Java和Python项目上进行了大量实验，以评估CDA-CS的性能。在增强数据集上训练模型，最先进的算法可以轻松地在不同的评估指标(即BLEU-4, METEOR, ROUGH_L)方面进一步提高1.37%至2.24%，而无需额外成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 13th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP)

自引率

0.00%

发文量