Two improvements to detect duplicates in Stack Overflow

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2017-02-01 DOI:10.1109/SANER.2017.7884678

Yuji Mizobuchi, K. Takayama

引用次数: 12

Abstract

Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained in the general corpuses. The evaluation shows that these approaches improve the accuracy compared with the traditional method.

查看原文本刊更多论文

在堆栈溢出中检测重复项的两个改进

Stack Overflow是最受程序员欢迎的问答网站之一。然而，有大量的重复问题需要在短时间内被自动检测出来。本文介绍了两种提高检测精度的方法:将正文分成不同类型的数据和使用词嵌入来处理一般语料库中不包含的词歧义。评价表明，与传统方法相比，这些方法提高了精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量