是否使用自然语言(NLoN)——一个用于软件工程文本分析管道的软件包

2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) Pub Date : 2018-03-20 DOI:10.1145/3196398.3196444

M. Mäntylä, Fabio Calefato, Maëlick Claes

{"title":"是否使用自然语言(NLoN)——一个用于软件工程文本分析管道的软件包","authors":"M. Mäntylä, Fabio Calefato, Maëlick Claes","doi":"10.1145/3196398.3196444","DOIUrl":null,"url":null,"abstract":"The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.","PeriodicalId":6639,"journal":{"name":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","volume":"18 1","pages":"387-391"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline\",\"authors\":\"M. Mäntylä, Fabio Calefato, Maëlick Claes\",\"doi\":\"10.1145/3196398.3196444\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.\",\"PeriodicalId\":6639,\"journal\":{\"name\":\"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)\",\"volume\":\"18 1\",\"pages\":\"387-391\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3196398.3196444\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3196398.3196444","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

自然语言处理(NLP)的使用在软件工程中越来越受欢迎。为了正确地执行NLP，我们必须对文本信息进行预处理，将自然语言与其他信息(如日志消息)分离开来，这些信息通常是软件工程中通信的一部分。我们提出了一种简单的方法来分类一些文本输入是否是自然语言。虽然我们的NLoN包只依赖于11个语言特征和字符三角图，但我们能够在三个不同的数据源上实现0.976-0.987之间的ROC曲线下的面积，使用Glmnet的Lasso回归作为我们的学习器，使用两个人类评分器提供基础真理。交叉源预测性能较低，波动较大，最高ROC性能为0.913 ~ 0.980。与以前的工作相比，我们的方法提供了类似的性能，但更加轻量级，使其更容易应用于软件工程文本挖掘管道。我们的源代码和数据作为r包提供，以供进一步改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline

The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量