软件代码语句的语言模型

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2017-10-30 DOI:10.1109/ASE.2017.8115678

Yixiao Yang, Yu Jiang, M. Gu, Jiaguang Sun, Jian Gao, Han Liu

{"title":"软件代码语句的语言模型","authors":"Yixiao Yang, Yu Jiang, M. Gu, Jiaguang Sun, Jian Gao, Han Liu","doi":"10.1109/ASE.2017.8115678","DOIUrl":null,"url":null,"abstract":"Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token-level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common subsequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%–60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%–90% of the cases with ten candidates.","PeriodicalId":382876,"journal":{"name":"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A language model for statements of software code\",\"authors\":\"Yixiao Yang, Yu Jiang, M. Gu, Jiaguang Sun, Jian Gao, Han Liu\",\"doi\":\"10.1109/ASE.2017.8115678\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token-level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common subsequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%–60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%–90% of the cases with ten candidates.\",\"PeriodicalId\":382876,\"journal\":{\"name\":\"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASE.2017.8115678\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASE.2017.8115678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

为源代码构建语言模型可以对传统的软件工程任务进行大量改进。一个很有前途的应用是自动代码完成。最先进的技术通过词法信息在令牌级别捕获代码规则。这种语言模型更适合于预测短的标记序列，但是对于长语句级的预测就不那么有效了。在本文中，我们提出了PCC来优化基于标记级的语言建模。具体来说，PCC为源代码引入了一种中间表示(IR)，它使用词素和可变的相对顺序将令牌分组。通过这种方式，PCC能够处理长令牌序列，即组序列，从而使用精确的合成器提出完整的语句。此外，PCC还采用了遗传算法和最长公共子序列算法相结合的模糊匹配技术来提高预测精度。我们已经为Eclipse实现了一个代码完成插件，并在开源Java项目上对其进行了评估。结果证明了PCC在生成精确的长语句级预测方面的潜力。在30%-60%的情况下，只有6个候选人，它可以正确地提出完整的陈述，在40%-90%的情况下，有10个候选人。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A language model for statements of software code

Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token-level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common subsequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%–60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%–90% of the cases with ten candidates.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量