{"title":"堆栈溢出岗位的无监督极端多标签分类","authors":"Peter Devine, Kelly Blincoe","doi":"10.1145/3528588.3528652","DOIUrl":null,"url":null,"abstract":"Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.","PeriodicalId":313397,"journal":{"name":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Unsupervised Extreme Multi Label Classification of Stack Overflow Posts\",\"authors\":\"Peter Devine, Kelly Blincoe\",\"doi\":\"10.1145/3528588.3528652\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.\",\"PeriodicalId\":313397,\"journal\":{\"name\":\"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3528588.3528652\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3528588.3528652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Unsupervised Extreme Multi Label Classification of Stack Overflow Posts
Knowing the topics of a software forum post, such as those on StackOverflow, allows for greater analysis and understanding of the large amounts of data that come from these communities. One approach to this problem is using extreme multi label classification (XMLC) to predict the topic (or “tag”) of a post from a potentially very large candidate label set. While previous work has trained these models on data which has explicit text-to-tag information, we assess the classification ability of embedding models which have not been trained using such structured data (and are thus “unsupervised”) to assess the potential applicability to other forums or domains in which tag data is not available.We evaluate 14 unsupervised pre-trained models on 0.1% of all StackOverflow posts against all 61,662 possible StackOverflow tags. We find that an MPNet model trained partially on unlabelled StackExchange data (i.e. without tag data) achieves the highest score overall for this task, with a recall score of 0.161 R@1. These results inform which models are most appropriate for use in XMLC of StackOverflow posts when supervised training is not feasible. This offers insight into these models’ applicability in similar but not identical domains, such as software product forums. These results suggest that training embedding models using in-domain title-body or question-answer pairs can create an effective zero-shot topic classifier for situations where no topic data is available.