Unsupervised Extreme Multi Label Classification of Stack Overflow Posts

Peter Devine, Kelly Blincoe
{"title":"Unsupervised Extreme Multi Label Classification of Stack Overflow Posts","authors":"Peter Devine, Kelly Blincoe","doi":"10.1145/3528588.3528652","DOIUrl":null,"url":null,"abstract":"Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.","PeriodicalId":313397,"journal":{"name":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3528588.3528652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.
堆栈溢出岗位的无监督极端多标签分类
了解软件论坛帖子的主题,例如StackOverflow上的主题,可以更好地分析和理解来自这些社区的大量数据。解决这个问题的一种方法是使用极端多标签分类(XMLC)从可能非常大的候选标签集中预测文章的主题(或“标签”)。虽然以前的工作已经在具有明确的文本到标签信息的数据上训练了这些模型,但我们评估了未使用此类结构化数据(因此是“无监督的”)训练的嵌入模型的分类能力,以评估其在标签数据不可用的其他论坛或领域的潜在适用性。我们在所有StackOverflow帖子的0.1%上对所有61,662个可能的StackOverflow标签评估了14个无监督预训练模型。我们发现,在未标记的StackExchange数据(即没有标签数据)上部分训练的MPNet模型在该任务中获得了最高的分数,召回分数为0.161 R@1。这些结果告诉我们,当监督训练不可行时,哪些模型最适合用于StackOverflow帖子的XMLC。这提供了对这些模型在类似但不相同的领域(如软件产品论坛)中的适用性的深入了解。这些结果表明,在没有主题数据可用的情况下,使用域内标题-正文或问答对训练嵌入模型可以创建有效的零采样主题分类器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信