Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC

IF 1.1 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS
M. Öztürk
{"title":"Developing a hyperparameter optimization method for classification of code snippets and questions of stack overflow: HyperSCC","authors":"M. Öztürk","doi":"10.4108/eai.27-5-2022.174084","DOIUrl":null,"url":null,"abstract":"Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.","PeriodicalId":43034,"journal":{"name":"EAI Endorsed Transactions on Scalable Information Systems","volume":"2012 1","pages":"e5"},"PeriodicalIF":1.1000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EAI Endorsed Transactions on Scalable Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4108/eai.27-5-2022.174084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Although there exist various machine learning and text mining techniques to identify the programming language of complete code files, multi-label code snippet prediction was not considered by the research community. This work aims at devising a tuner for multi-label programming language prediction of stack overflow posts. To that end, a Hyper Source Code Classifier (HyperSCC) is devised along with rule-based automatic labeling by considering the bottlenecks of multi-label classification. The proposed method is evaluated on seven multi-label predictors to conduct an extensive analysis. The method is further compared with the three competitive alternatives in terms of one-label programming language prediction. HyperSCC outperformed the other methods in terms of the F1 score. Preprocessing results in a high reduction (50%) of training time when ensemble multi-label predictors are employed. In one-label programming language prediction, Gradient Boosting Machine (gbm) yields the highest accuracy (0.99) in predicting R posts that have a lot of distinctive words determining labels. The findings support the hypothesis that multi-label predictors can be strengthened with sophisticated feature selection and labeling approaches.
开发一种用于代码片段分类和堆栈溢出问题的超参数优化方法:HyperSCC
虽然已有各种机器学习和文本挖掘技术来识别完整代码文件的编程语言,但多标签代码片段预测尚未被研究界考虑。本工作旨在设计一个多标签编程语言预测堆栈溢出帖子的调谐器。为此,考虑到多标签分类的瓶颈,设计了基于规则的自动标注的超源代码分类器(HyperSCC)。提出的方法是评估七个多标签预测进行广泛的分析。在单标签编程语言预测方面,进一步将该方法与三种竞争方案进行了比较。在F1评分方面,HyperSCC优于其他方法。当使用集成多标签预测器时,预处理结果可将训练时间大幅减少(50%)。在单标签编程语言预测中,梯度增强机(Gradient Boosting Machine, gbm)在预测具有许多独特单词决定标签的R帖子时产生了最高的准确率(0.99)。研究结果支持了多标签预测器可以通过复杂的特征选择和标记方法得到加强的假设。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
EAI Endorsed Transactions on Scalable Information Systems
EAI Endorsed Transactions on Scalable Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-
CiteScore
2.80
自引率
15.40%
发文量
49
审稿时长
10 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信