A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports

2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER) Pub Date : 2015-03-01 DOI:10.1109/SANER.2015.7081879

Yuan Tian, D. Lo

{"title":"A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports","authors":"Yuan Tian, D. Lo","doi":"10.1109/SANER.2015.7081879","DOIUrl":null,"url":null,"abstract":"Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assigns POS tags (e.g., verb, noun, etc.) to words in a sentence. In the literature, several studies have already applied POS tagging technique on software artifacts to recover important words in them, which are then used for automating various tasks, e.g., locating buggy files for a given bug report, etc. There are many POS tagging techniques proposed and they are trained and evaluated on non software engineering corpus (documents). Thus it is unknown whether they can correctly identify the POS of a word in a software artifact and which of them performs the best. To fill this gap, in this work, we investigate the effectiveness of seven POS taggers on bug reports. We randomly sample 100 bug reports from Eclipse and Mozilla project and create a text corpus that contains 21,713 words. We manually assign POS tags to these words and use them to evaluate the studied POS taggers. Our comparative study shows that the state-of-the-art POS taggers achieve an accuracy of 83.6%-90.5% on bug reports and the Stanford POS tagger and the TreeTagger achieve the highest accuracy on the sampled bug reports. Our findings show that researchers could use these POS taggers to analyze software artifacts, if an accuracy of 80-90% is acceptable for their specific needs, and we recommend using the Stanford POS tagger or the TreeTagger.","PeriodicalId":355949,"journal":{"name":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2015.7081879","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

Many software artifacts are written in natural language or contain substantial amount of natural language contents. Thus these artifacts could be analyzed using text analysis techniques from the natural language processing (NLP) community, e.g., the part-of-speech (POS) tagging technique that assigns POS tags (e.g., verb, noun, etc.) to words in a sentence. In the literature, several studies have already applied POS tagging technique on software artifacts to recover important words in them, which are then used for automating various tasks, e.g., locating buggy files for a given bug report, etc. There are many POS tagging techniques proposed and they are trained and evaluated on non software engineering corpus (documents). Thus it is unknown whether they can correctly identify the POS of a word in a software artifact and which of them performs the best. To fill this gap, in this work, we investigate the effectiveness of seven POS taggers on bug reports. We randomly sample 100 bug reports from Eclipse and Mozilla project and create a text corpus that contains 21,713 words. We manually assign POS tags to these words and use them to evaluate the studied POS taggers. Our comparative study shows that the state-of-the-art POS taggers achieve an accuracy of 83.6%-90.5% on bug reports and the Stanford POS tagger and the TreeTagger achieve the highest accuracy on the sampled bug reports. Our findings show that researchers could use these POS taggers to analyze software artifacts, if an accuracy of 80-90% is acceptable for their specific needs, and we recommend using the Stanford POS tagger or the TreeTagger.

查看原文本刊更多论文

词性标注技术在bug报告中的有效性比较研究

许多软件工件是用自然语言编写的，或者包含大量的自然语言内容。因此，可以使用自然语言处理(NLP)社区的文本分析技术来分析这些工件，例如，词性标记技术(POS)将词性标记(例如，动词、名词等)分配给句子中的单词。在文献中，一些研究已经将POS标注技术应用于软件工件上，以恢复其中的重要单词，然后将其用于自动化各种任务，例如，为给定的错误报告定位错误文件等。提出了许多词性标注技术，并在非软件工程语料库(文档)上进行了培训和评估。因此，我们不知道它们是否能正确识别软件工件中单词的词性，也不知道它们中哪一个表现最好。为了填补这一空白，在这项工作中，我们研究了七种POS标记器在错误报告中的有效性。我们从Eclipse和Mozilla项目中随机抽取100个bug报告，并创建一个包含21,713个单词的文本语料库。我们手动为这些词分配词性标记，并用它们来评估所研究的词性标记。我们的对比研究表明，最先进的POS标注器在错误报告上的准确率为83.6%-90.5%，斯坦福POS标注器和TreeTagger在样本错误报告上的准确率最高。我们的研究结果表明，研究人员可以使用这些POS标记器来分析软件工件，如果精确度为80-90%可以接受他们的特定需求，我们建议使用斯坦福POS标记器或TreeTagger。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER)

自引率

0.00%

发文量