Unit Under Test Identification Using Natural Language Processing Techniques

IF 1.1 Q3 COMPUTER SCIENCE, THEORY & METHODS
Matej Madeja, J. Porubän
{"title":"Unit Under Test Identification Using Natural Language Processing Techniques","authors":"Matej Madeja, J. Porubän","doi":"10.1515/comp-2020-0150","DOIUrl":null,"url":null,"abstract":"Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.","PeriodicalId":43014,"journal":{"name":"Open Computer Science","volume":"11 1","pages":"22 - 32"},"PeriodicalIF":1.1000,"publicationDate":"2020-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/comp-2020-0150","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/comp-2020-0150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.
利用自然语言处理技术识别被测单元
摘要被测单元识别(UUT)通常由于测试气味而困难,例如在一次测试中测试多个UUT。因为测试最能反映当前的产品规范,所以它们可以用来理解生产代码的各个部分以及它们之间的关系。由于测试和UUT之间有相似的词汇,本文在5个流行的Github项目的源代码中使用了五种NLP技术。将收集的结果与手动识别的UUT进行比较。对于右侧UUT,tf-idf模型实现了22%的最佳精度,并且在手动识别的公差高达第五位的情况下实现了57%的最佳精度。这些结果是在对输入文档进行java关键词去除和分词预处理后获得的。tf-idf模型实现了最佳的模型训练时间,每个请求的索引搜索时间在1秒内,因此它可以在集成开发环境(IDE)中用作未来的支持工具。同时,研究发现,对于文档预处理,分词能最好地提高准确性,而去除java关键字对tf-idf模型结果的改善很小。删除注释只会略微恶化自然语言处理(NLP)模型的准确性。最佳速度提供了一个项目中每个文档平均0.3秒的预处理时间的分词。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Open Computer Science
Open Computer Science COMPUTER SCIENCE, THEORY & METHODS-
CiteScore
4.00
自引率
0.00%
发文量
24
审稿时长
25 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信