在堆栈溢出中检测重复项的两个改进

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2017-02-01 DOI:10.1109/SANER.2017.7884678

Yuji Mizobuchi, K. Takayama

{"title":"在堆栈溢出中检测重复项的两个改进","authors":"Yuji Mizobuchi, K. Takayama","doi":"10.1109/SANER.2017.7884678","DOIUrl":null,"url":null,"abstract":"Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained in the general corpuses. The evaluation shows that these approaches improve the accuracy compared with the traditional method.","PeriodicalId":6541,"journal":{"name":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"52 1","pages":"563-564"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Two improvements to detect duplicates in Stack Overflow\",\"authors\":\"Yuji Mizobuchi, K. Takayama\",\"doi\":\"10.1109/SANER.2017.7884678\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained in the general corpuses. The evaluation shows that these approaches improve the accuracy compared with the traditional method.\",\"PeriodicalId\":6541,\"journal\":{\"name\":\"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"volume\":\"52 1\",\"pages\":\"563-564\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SANER.2017.7884678\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2017.7884678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

Stack Overflow是最受程序员欢迎的问答网站之一。然而，有大量的重复问题需要在短时间内被自动检测出来。本文介绍了两种提高检测精度的方法:将正文分成不同类型的数据和使用词嵌入来处理一般语料库中不包含的词歧义。评价表明，与传统方法相比，这些方法提高了精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Two improvements to detect duplicates in Stack Overflow

Stack Overflow is one of the most popular question-and-answer sites for programmers. However, there are a great number of duplicate questions that are expected to be detected automatically in a short time. In this paper, we introduce two approaches to improve the detection accuracy: splitting body into different types of data and using word-embedding to treat word ambiguities that are not contained in the general corpuses. The evaluation shows that these approaches improve the accuracy compared with the traditional method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量