基于主题袋的文本复制检测算法

2010 International Conference on Web Information Systems and Mining Pub Date : 2010-10-23 DOI:10.1109/WISM.2010.159

Sen Wang, Yu Wang

{"title":"基于主题袋的文本复制检测算法","authors":"Sen Wang, Yu Wang","doi":"10.1109/WISM.2010.159","DOIUrl":null,"url":null,"abstract":"In order to resolve the current problem about seriously academic plagiarism in the web environment, this article proposes an algorithm of the text copy detection on the topic bag and the algorithm uses the idea of semantic clustering and multi-instance learning. Firstly, a paper is divided into three layers construction tree: a leaf node denotes a sentence; a branch node represents a topic bag, and the topic bag formed by semantic clustering of several paragraphs; the uppermost a root node is a text. Secondly, the similarities of topic bags are calculated by the similarities of sentences; then we can get the similarity of two papers by similarities and weights of topic bags. Experiments show that the proposed algorithm has higher accuracy.","PeriodicalId":119569,"journal":{"name":"2010 International Conference on Web Information Systems and Mining","volume":"3 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Algorithm of the Text Copy Detection Based on Topic Bag\",\"authors\":\"Sen Wang, Yu Wang\",\"doi\":\"10.1109/WISM.2010.159\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to resolve the current problem about seriously academic plagiarism in the web environment, this article proposes an algorithm of the text copy detection on the topic bag and the algorithm uses the idea of semantic clustering and multi-instance learning. Firstly, a paper is divided into three layers construction tree: a leaf node denotes a sentence; a branch node represents a topic bag, and the topic bag formed by semantic clustering of several paragraphs; the uppermost a root node is a text. Secondly, the similarities of topic bags are calculated by the similarities of sentences; then we can get the similarity of two papers by similarities and weights of topic bags. Experiments show that the proposed algorithm has higher accuracy.\",\"PeriodicalId\":119569,\"journal\":{\"name\":\"2010 International Conference on Web Information Systems and Mining\",\"volume\":\"3 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 International Conference on Web Information Systems and Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WISM.2010.159\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Conference on Web Information Systems and Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISM.2010.159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

为了解决当前网络环境下学术剽窃严重的问题，本文提出了一种基于主题袋的文本复制检测算法，该算法采用了语义聚类和多实例学习的思想。首先，将一篇论文分为三层结构树:一个叶节点表示一个句子;分支节点代表一个主题袋，由几个段落的语义聚类而成的主题袋;最上面的根节点是一个文本。其次，根据句子的相似度计算主题袋的相似度;然后通过主题袋的相似度和权重得到两篇论文的相似度。实验表明，该算法具有较高的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Algorithm of the Text Copy Detection Based on Topic Bag

In order to resolve the current problem about seriously academic plagiarism in the web environment, this article proposes an algorithm of the text copy detection on the topic bag and the algorithm uses the idea of semantic clustering and multi-instance learning. Firstly, a paper is divided into three layers construction tree: a leaf node denotes a sentence; a branch node represents a topic bag, and the topic bag formed by semantic clustering of several paragraphs; the uppermost a root node is a text. Secondly, the similarities of topic bags are calculated by the similarities of sentences; then we can get the similarity of two papers by similarities and weights of topic bags. Experiments show that the proposed algorithm has higher accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 International Conference on Web Information Systems and Mining

自引率

0.00%

发文量