On the Evaluation of Commit Message Generation Models: An Experimental Study

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2021-07-12 DOI:10.26226/morressier.613b5419842293c031b5b634

Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Hongyu Zhang, Dongmei Zhang, Wenqiang Zhang

{"title":"On the Evaluation of Commit Message Generation Models: An Experimental Study","authors":"Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Hongyu Zhang, Dongmei Zhang, Wenqiang Zhang","doi":"10.26226/morressier.613b5419842293c031b5b634","DOIUrl":null,"url":null,"abstract":"Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated frequently. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets. We find that: (1) Different variants of the BLEU metric are used in previous works, which affects the evaluation and understanding of existing methods. (2) Most existing datasets are crawled only from Java repositories while repositories in other programming languages are not sufficiently explored. (3) Dataset splitting strategies can influence the performance of existing models by a large margin. Some models show better performance when the datasets are split by commit, while other models perform better when the datasets are split by timestamp or by project. Based on our findings, we conduct a human evaluation and find the BLEU metric that best correlates with the human scores for the task. We also collect a large-scale, information-rich, and multi-language commit message dataset MCMD and evaluate existing models on this dataset. Furthermore, we conduct extensive experiments under different dataset splitting strategies and suggest the suitable models under different scenarios. Based on the experimental results and findings, we provide feasible suggestions for comprehensively evaluating commit message generation models and discuss possible future research directions. We believe this work can help practitioners and researchers better evaluate and select models for automatic commit message generation. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical.","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26226/morressier.613b5419842293c031b5b634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated frequently. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets. We find that: (1) Different variants of the BLEU metric are used in previous works, which affects the evaluation and understanding of existing methods. (2) Most existing datasets are crawled only from Java repositories while repositories in other programming languages are not sufficiently explored. (3) Dataset splitting strategies can influence the performance of existing models by a large margin. Some models show better performance when the datasets are split by commit, while other models perform better when the datasets are split by timestamp or by project. Based on our findings, we conduct a human evaluation and find the BLEU metric that best correlates with the human scores for the task. We also collect a large-scale, information-rich, and multi-language commit message dataset MCMD and evaluate existing models on this dataset. Furthermore, we conduct extensive experiments under different dataset splitting strategies and suggest the suitable models under different scenarios. Based on the experimental results and findings, we provide feasible suggestions for comprehensively evaluating commit message generation models and discuss possible future research directions. We believe this work can help practitioners and researchers better evaluate and select models for automatic commit message generation. Our source code and data are available at https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical.

查看原文本刊更多论文

提交消息生成模型评价的实验研究

提交消息是对代码更改的自然语言描述，对于程序的理解和维护非常重要。然而，手动编写提交消息既耗时又费力，尤其是在代码频繁更新的情况下。已经提出了各种利用生成或检索技术来自动生成提交消息的方法。为了更好地理解现有方法如何解决这一问题，本文对最先进的模型和数据集进行了系统和深入的分析。我们发现:(1)以往的工作中使用了不同的BLEU度量，这影响了对现有方法的评价和理解。(2)大多数现有数据集仅从Java存储库中抓取，而其他编程语言的存储库没有得到充分的探索。(3)数据集分割策略会对现有模型的性能产生较大影响。当数据集按提交分割时，一些模型表现出更好的性能，而其他模型在数据集按时间戳或按项目分割时表现更好。根据我们的发现，我们进行人工评估，并找到与任务的人工得分最相关的BLEU指标。我们还收集了一个大规模的、信息丰富的、多语言的提交消息数据集MCMD，并在该数据集上评估了现有的模型。此外，我们在不同的数据集分割策略下进行了大量的实验，并在不同的场景下提出了合适的模型。基于实验结果和发现，提出了综合评价提交消息生成模型的可行性建议，并讨论了未来可能的研究方向。我们相信这项工作可以帮助从业者和研究人员更好地评估和选择自动提交消息生成的模型。我们的源代码和数据可在https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量