On Using Active Learning and Self-training when Mining Performance Discussions on Stack Overflow

Markus Borg, Iben Lennerstad, R. Ros, E. Bjarnason
{"title":"On Using Active Learning and Self-training when Mining Performance Discussions on Stack Overflow","authors":"Markus Borg, Iben Lennerstad, R. Ros, E. Bjarnason","doi":"10.1145/3084226.3084273","DOIUrl":null,"url":null,"abstract":"Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.","PeriodicalId":192290,"journal":{"name":"Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3084226.3084273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabelled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.
基于主动学习和自训练的堆栈溢出挖掘性能讨论
丰富的数据是机器学习成功的关键。然而,监督学习需要通常难以获得的注释数据。在资源有限的分类任务中,主动学习(AL)承诺引导注释器找到对分类器最有价值的示例。人工智能可以成功地与自我训练相结合,即使用分类器最确定的未标记示例扩展训练集。我们报告了我们在讨论软件组件性能的Stack Overflow帖子中以系统的方式使用AL来训练SVM分类器的经验。我们发现,被认为对分类器最有价值的训练样例也是人类最难标注的。尽管精心制定了注释标准,但我们报告了较低的评级间一致性,但我们也提出了缓解策略。最后,以一名标注员的工作为例,证明了自我训练可以提高分类准确率。最后,我们讨论了对未来渴望使用人工智能和自我训练的文本挖掘者的启示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信