挖掘源代码自动分割标识符用于软件分析

Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker
{"title":"挖掘源代码自动分割标识符用于软件分析","authors":"Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker","doi":"10.1109/MSR.2009.5069482","DOIUrl":null,"url":null,"abstract":"Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.","PeriodicalId":413721,"journal":{"name":"2009 6th IEEE International Working Conference on Mining Software Repositories","volume":"312 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"193","resultStr":"{\"title\":\"Mining source code to automatically split identifiers for software analysis\",\"authors\":\"Eric Enslen, Emily Hill, L. Pollock, K. Vijay-Shanker\",\"doi\":\"10.1109/MSR.2009.5069482\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.\",\"PeriodicalId\":413721,\"journal\":{\"name\":\"2009 6th IEEE International Working Conference on Mining Software Repositories\",\"volume\":\"312 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"193\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 6th IEEE International Working Conference on Mining Software Repositories\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSR.2009.5069482\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 6th IEEE International Working Conference on Mining Software Repositories","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSR.2009.5069482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 193

摘要

自动化的软件工程工具(例如,程序搜索、关注点定位、代码重用、质量评估等)越来越依赖于来自代码注释和标识符的自然语言信息。从标识符中分析单词的第一步需要将标识符拆分为组成它们的单词。与使用空格和标点符号来描绘单词的自然语言不同,标识符不能包含空格。分割标识符的一种常用方法是遵循编程语言的命名约定。例如,Java程序员经常使用驼峰大小写,其中单词由大写字母或非字母字符描述。然而,程序员也通过将单词序列连接在一起来创建标识符,而没有可识别的描述,这对自动标识符分割提出了挑战。本文提出了一种通过挖掘源代码中的词频将标识符自动分割成词序列的算法。有了这些单词频率,我们的标识符分配器使用评分技术自动为标识符选择最合适的分区。在对来自开放源码Java程序的8000多个标识符的评估中,我们的Samurai方法优于现有的艺术技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Mining source code to automatically split identifiers for software analysis
Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信