Question Similarity Detection on Stack Overflow Sites

2022 XVLIII Latin American Computer Conference (CLEI) Pub Date : 2022-10-17 DOI:10.1109/CLEI56649.2022.9959915

M. Botto-Tobar

{"title":"Question Similarity Detection on Stack Overflow Sites","authors":"M. Botto-Tobar","doi":"10.1109/CLEI56649.2022.9959915","DOIUrl":null,"url":null,"abstract":"Community-Based Question Answering (CQA) has grown in popularity as a way for people from all backgrounds to share information and knowledge. Stack Overflow is a widespread CQA website that focuses on problems and queries related to programming. Many of the questions posted on Stack Overflow have already been answered. However, two questions that ask the same thing could have vastly different vocabulary and grammatical structures, making determining their semantic equivalence difficult. Automatic duplicate detection saves moderators time before taking action and also assists question issuers in finding solutions rapidly. Also, finding a similar question on two different websites in two different languages is a troublesome task. Thus, the proposed approach focuses on similarity detection on the Stack Overflow website in English and Spanish. It prepares labeled data by collecting the questions from both websites and providing the labels manually. Moreover, it utilizes the Synthetic Minority Oversampling Technique (SMOTE) data augmentation technique for data balancing. This work also uses machine learning techniques such as neural networks, Word Mover Distance (WDM), and Logistic Regression for detecting similar questions on SO and SO-ES sites. The model is evaluated using standard metrics such as the confusion matrix, accuracy, and recall. Logistic Regression outperforms the other three algorithms in terms of accuracy, while WDM performs well in terms of recall.","PeriodicalId":156073,"journal":{"name":"2022 XVLIII Latin American Computer Conference (CLEI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 XVLIII Latin American Computer Conference (CLEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLEI56649.2022.9959915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Community-Based Question Answering (CQA) has grown in popularity as a way for people from all backgrounds to share information and knowledge. Stack Overflow is a widespread CQA website that focuses on problems and queries related to programming. Many of the questions posted on Stack Overflow have already been answered. However, two questions that ask the same thing could have vastly different vocabulary and grammatical structures, making determining their semantic equivalence difficult. Automatic duplicate detection saves moderators time before taking action and also assists question issuers in finding solutions rapidly. Also, finding a similar question on two different websites in two different languages is a troublesome task. Thus, the proposed approach focuses on similarity detection on the Stack Overflow website in English and Spanish. It prepares labeled data by collecting the questions from both websites and providing the labels manually. Moreover, it utilizes the Synthetic Minority Oversampling Technique (SMOTE) data augmentation technique for data balancing. This work also uses machine learning techniques such as neural networks, Word Mover Distance (WDM), and Logistic Regression for detecting similar questions on SO and SO-ES sites. The model is evaluated using standard metrics such as the confusion matrix, accuracy, and recall. Logistic Regression outperforms the other three algorithms in terms of accuracy, while WDM performs well in terms of recall.

查看原文本刊更多论文

堆栈溢出站点的问题相似度检测

基于社区的问答(CQA)作为一种各种背景的人们共享信息和知识的方式，已经越来越受欢迎。Stack Overflow是一个广泛的CQA网站，主要关注与编程相关的问题和查询。Stack Overflow上的许多问题都已经得到了解答。然而，问同样问题的两个问题可能有截然不同的词汇和语法结构，这使得确定它们的语义等价变得困难。自动重复检测节省了版主在采取行动之前的时间，也帮助问题发布者快速找到解决方案。此外，在两个不同的网站上用两种不同的语言找到类似的问题是一项麻烦的任务。因此，所提出的方法侧重于英语和西班牙语Stack Overflow网站的相似性检测。它通过从两个网站收集问题并手动提供标签来准备标记数据。此外，它还利用了合成少数派过采样技术(SMOTE)数据增强技术来实现数据平衡。这项工作还使用机器学习技术，如神经网络、Word Mover Distance (WDM)和逻辑回归来检测SO和SO- es网站上的类似问题。使用混淆矩阵、准确性和召回率等标准度量来评估模型。逻辑回归在准确性方面优于其他三种算法，而WDM在召回率方面表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 XVLIII Latin American Computer Conference (CLEI)

自引率

0.00%

发文量