On Using Active Learning and Self-training when Mining Performance Discussions on Stack Overflow

Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering Pub Date : 2017-04-26 DOI:10.1145/3084226.3084273

Markus Borg, Iben Lennerstad, R. Ros, E. Bjarnason

{"title":"On Using Active Learning and Self-training when Mining Performance Discussions on Stack Overflow","authors":"Markus Borg, Iben Lennerstad, R. Ros, E. Bjarnason","doi":"10.1145/3084226.3084273","DOIUrl":null,"url":null,"abstract":"Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.","PeriodicalId":192290,"journal":{"name":"Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3084226.3084273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.

查看原文本刊更多论文

基于主动学习和自训练的堆栈溢出挖掘性能讨论

丰富的数据是机器学习成功的关键。然而，监督学习需要通常难以获得的注释数据。在资源有限的分类任务中，主动学习(AL)承诺引导注释器找到对分类器最有价值的示例。人工智能可以成功地与自我训练相结合，即使用分类器最确定的未标记示例扩展训练集。我们报告了我们在讨论软件组件性能的Stack Overflow帖子中以系统的方式使用AL来训练SVM分类器的经验。我们发现，被认为对分类器最有价值的训练样例也是人类最难标注的。尽管精心制定了注释标准，但我们报告了较低的评级间一致性，但我们也提出了缓解策略。最后，以一名标注员的工作为例，证明了自我训练可以提高分类准确率。最后，我们讨论了对未来渴望使用人工智能和自我训练的文本挖掘者的启示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering

自引率

0.00%

发文量