Using Latent Dirichlet Allocation for automatic categorization of software

Kai Tian, Meghan Revelle, D. Poshyvanyk
{"title":"Using Latent Dirichlet Allocation for automatic categorization of software","authors":"Kai Tian, Meghan Revelle, D. Poshyvanyk","doi":"10.1109/MSR.2009.5069496","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"170","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 6th IEEE International Working Conference on Mining Software Repositories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2009.5069496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 170

Abstract

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.
基于潜狄利克雷分配的自动分类软件
在本文中,我们提出了一种称为LACT的技术,用于对开源存储库中的软件系统进行自动分类。LACT是一种基于潜在狄利克雷分配的信息检索方法,该方法用于将源代码文档作为概率主题的混合物进行索引和分析。为了初步评估,我们进行了两项研究。在第一项研究中,将LACT与现有的工具mudabblue进行比较,将41个用C编写的软件系统划分为问题领域类别。结果表明,LACT可以自动生成有意义的分类名称,分类结果与mudabblue相当。在第二项研究中,我们将LACT应用于43个用C/ c++、Java、c#、PHP和Perl等不同编程语言编写的软件系统。结果表明,无论底层编程语言或范式如何,LACT都可以有效地用于软件系统的自动分类。此外,两项研究都表明,LACT可以识别基于库、体系结构或编程语言的几个新类别,与手动分类和现有技术相比,这是一个有希望的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信