Understanding LDA in source code analysis

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) Pub Date : 2014-06-02 DOI:10.1145/2597008.2597150

D. Binkley, Daniel Heinz, Dawn J Lawrie, J. Overfelt

{"title":"Understanding LDA in source code analysis","authors":"D. Binkley, Daniel Heinz, Dawn J Lawrie, J. Overfelt","doi":"10.1145/2597008.2597150","DOIUrl":null,"url":null,"abstract":"Latent Dirichlet Allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: the technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. An obvious example is the burn-in period; too short a burn-in period leaves excessive echoes of the initial uniform distribution. The aim of this work is to provide insights into the tuning parameter's impact. Doing so improves the comprehension of both, 1) researchers who look to exploit the power of LDA in their research and 2) those who interpret the output of LDA-using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Results obtained using both production source code and a synthetic corpus underscore the need for a solid understanding of how to configure LDA's tuning parameters.","PeriodicalId":6853,"journal":{"name":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","volume":"19 1","pages":"26-36"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597008.2597150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

Latent Dirichlet Allocation (LDA) has seen increasing use in the understanding of source code and its related artifacts in part because of its impressive modeling power. However, this expressive power comes at a cost: the technique includes several tuning parameters whose impact on the resulting LDA model must be carefully considered. An obvious example is the burn-in period; too short a burn-in period leaves excessive echoes of the initial uniform distribution. The aim of this work is to provide insights into the tuning parameter's impact. Doing so improves the comprehension of both, 1) researchers who look to exploit the power of LDA in their research and 2) those who interpret the output of LDA-using tools. It is important to recognize that the goal of this work is not to establish values for the tuning parameters because there is no universal best setting. Rather, appropriate settings depend on the problem being solved, the input corpus (in this case, typically words from the source code and its supporting artifacts), and the needs of the engineer performing the analysis. This work's primary goal is to aid software engineers in their understanding of the LDA tuning parameters by demonstrating numerically and graphically the relationship between the tuning parameters and the LDA output. A secondary goal is to enable more informed setting of the parameters. Results obtained using both production source code and a synthetic corpus underscore the need for a solid understanding of how to configure LDA's tuning parameters.

查看原文本刊更多论文

在源代码分析中理解LDA

潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)在理解源代码及其相关工件方面的应用越来越多，部分原因是它具有令人印象深刻的建模能力。然而，这种表达能力是有代价的:该技术包括几个调优参数，必须仔细考虑它们对最终LDA模型的影响。一个明显的例子是磨合期;过短的磨合期会留下过多的初始均匀分布的回声。这项工作的目的是深入了解调优参数的影响。这样做可以提高两方面的理解:1)希望在研究中利用LDA的力量的研究人员和2)使用工具解释LDA输出的研究人员。重要的是要认识到，这项工作的目标不是建立调优参数的值，因为没有通用的最佳设置。相反，适当的设置取决于要解决的问题、输入语料库(在本例中，通常是来自源代码及其支持工件的单词)，以及执行分析的工程师的需求。这项工作的主要目标是通过数值和图形方式演示调优参数与LDA输出之间的关系，帮助软件工程师理解LDA调优参数。第二个目标是实现更明智的参数设置。使用生产源代码和合成语料库获得的结果强调了对如何配置LDA调优参数有深入了解的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量