Unit Under Test Identification Using Natural Language Processing Techniques

IF 1.2 Q3 COMPUTER SCIENCE, THEORY & METHODS

Open Computer Science Pub Date : 2020-12-17 DOI:10.1515/comp-2020-0150

Matej Madeja, J. Porubän

{"title":"Unit Under Test Identification Using Natural Language Processing Techniques","authors":"Matej Madeja, J. Porubän","doi":"10.1515/comp-2020-0150","DOIUrl":null,"url":null,"abstract":"Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.","PeriodicalId":43014,"journal":{"name":"Open Computer Science","volume":"11 1","pages":"22 - 32"},"PeriodicalIF":1.2000,"publicationDate":"2020-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/comp-2020-0150","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/comp-2020-0150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Unit under test identification (UUT) is often difficult due to test smells, such as testing multiple UUTs in one test. Because the tests best reflect the current product specification they can be used to comprehend parts of the production code and the relationships between them. Because there is a similar vocabulary between the test and UUT, five NLP techniques were used on the source code of 5 popular Github projects in this paper. The collected results were compared with the manually identified UUTs. The tf-idf model achieved the best accuracy of 22% for a right UUT and 57% with a tolerance up to fifth place of manual identification. These results were obtained after preprocessing input documents with java keywords removal and word split. The tf-idf model achieved the best model training time and the index search takes within 1s per request, so it could be used in an Integrated Development Environment (IDE) as a support tool in the future. At the same time, it has been found that, for document preprocessing, word splitting improves accuracy best and removing java keywords has just a small improvement for tf-idf model results. Removing comments only slightly worsens the accuracy of Natural Language Processing (NLP) models. The best speed provided the word splitting with average 0.3s preprocessing time per all documents in a project.

查看原文本刊更多论文

利用自然语言处理技术识别被测单元

摘要被测单元识别（UUT）通常由于测试气味而困难，例如在一次测试中测试多个UUT。因为测试最能反映当前的产品规范，所以它们可以用来理解生产代码的各个部分以及它们之间的关系。由于测试和UUT之间有相似的词汇，本文在5个流行的Github项目的源代码中使用了五种NLP技术。将收集的结果与手动识别的UUT进行比较。对于右侧UUT，tf-idf模型实现了22%的最佳精度，并且在手动识别的公差高达第五位的情况下实现了57%的最佳精度。这些结果是在对输入文档进行java关键词去除和分词预处理后获得的。tf-idf模型实现了最佳的模型训练时间，每个请求的索引搜索时间在1秒内，因此它可以在集成开发环境（IDE）中用作未来的支持工具。同时，研究发现，对于文档预处理，分词能最好地提高准确性，而去除java关键字对tf-idf模型结果的改善很小。删除注释只会略微恶化自然语言处理（NLP）模型的准确性。最佳速度提供了一个项目中每个文档平均0.3秒的预处理时间的分词。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊