Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

IF 0.8 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

EAI Endorsed Transactions on Scalable Information Systems Pub Date : 2022-05-27 DOI:10.4108/eai.27-5-2022.174084

M. Öztürk

{"title":"Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC","authors":"M. Öztürk","doi":"10.4108/eai.27-5-2022.174084","DOIUrl":null,"url":null,"abstract":"Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.","PeriodicalId":43034,"journal":{"name":"EAI Endorsed Transactions on Scalable Information Systems","volume":"2012 1","pages":"e5"},"PeriodicalIF":0.8000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Transactions on Scalable Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eai.27-5-2022.174084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.

查看原文本刊更多论文

开发一种用于代码片段分类和堆栈溢出问题的超参数优化方法:HyperSCC

虽然已有各种机器学习和文本挖掘技术来识别完整代码文件的编程语言，但多标签代码片段预测尚未被研究界考虑。本工作旨在设计一个多标签编程语言预测堆栈溢出帖子的调谐器。为此，考虑到多标签分类的瓶颈，设计了基于规则的自动标注的超源代码分类器(HyperSCC)。提出的方法是评估七个多标签预测进行广泛的分析。在单标签编程语言预测方面，进一步将该方法与三种竞争方案进行了比较。在F1评分方面，HyperSCC优于其他方法。当使用集成多标签预测器时，预处理结果可将训练时间大幅减少(50%)。在单标签编程语言预测中，梯度增强机(Gradient Boosting Machine, gbm)在预测具有许多独特单词决定标签的R帖子时产生了最高的准确率(0.99)。研究结果支持了多标签预测器可以通过复杂的特征选择和标记方法得到加强的假设。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊