开放词汇码补全的优化标记化过程:实证研究

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering Pub Date : 2023-06-14 DOI:10.1145/3593434.3594236

Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas

{"title":"开放词汇码补全的优化标记化过程:实证研究","authors":"Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas","doi":"10.1145/3593434.3594236","DOIUrl":null,"url":null,"abstract":"Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.","PeriodicalId":178596,"journal":{"name":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study\",\"authors\":\"Yasir Hussain, Zhiqiu Huang, Yu Zhou, I. A. Khan, Nasrullah Khan, Muhammad Zahid Abbas\",\"doi\":\"10.1145/3593434.3594236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.\",\"PeriodicalId\":178596,\"journal\":{\"name\":\"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3593434.3594236\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3593434.3594236","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

研究已经证实了基于深度学习的模型在各种源代码建模任务中的有效性。这些模型通常在大型数据集上进行训练，这些数据集被分成更小的单元，称为标记，利用开放或封闭的词汇系统。标记化方法的选择会对生成的标记数量产生深远的影响，而这反过来又会显著影响模型的性能。本文研究了不同的标记化方法对源代码建模的影响，并提出了一种优化的标记器来提高标记化性能。所提出的标记器采用了一种混合方法，该方法初始化基于最常见字母的全局词汇表，并逐步构建开放词汇表系统。提出的标记器是针对流行的标记器方法(如Closed, Unigram, WordPiece和BPE标记器)以及由大型预训练模型(如PolyCoder和CodeGen)提供的标记器进行评估的。结果表明，标记化方法的选择会显著影响生成的子标记的数量，最终影响模型的建模性能。此外，我们的经验评估表明，所提出的标记器优于其他基准，在减少子标记数量和时间成本方面实现了改进的标记化性能。总之，本研究强调了在源代码建模中选择标记化方法的重要性以及通过优化标记化技术进行改进的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study

Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

自引率

0.00%

发文量