Algorithm of the Text Copy Detection Based on Topic Bag

2010 International Conference on Web Information Systems and Mining Pub Date : 2010-10-23 DOI:10.1109/WISM.2010.159

Sen Wang, Yu Wang

引用次数: 0

Abstract

In order to resolve the current problem about seriously academic plagiarism in the web environment, this article proposes an algorithm of the text copy detection on the topic bag and the algorithm uses the idea of semantic clustering and multi-instance learning. Firstly, a paper is divided into three layers construction tree: a leaf node denotes a sentence; a branch node represents a topic bag, and the topic bag formed by semantic clustering of several paragraphs; the uppermost a root node is a text. Secondly, the similarities of topic bags are calculated by the similarities of sentences; then we can get the similarity of two papers by similarities and weights of topic bags. Experiments show that the proposed algorithm has higher accuracy.

查看原文本刊更多论文

基于主题袋的文本复制检测算法

为了解决当前网络环境下学术剽窃严重的问题，本文提出了一种基于主题袋的文本复制检测算法，该算法采用了语义聚类和多实例学习的思想。首先，将一篇论文分为三层结构树:一个叶节点表示一个句子;分支节点代表一个主题袋，由几个段落的语义聚类而成的主题袋;最上面的根节点是一个文本。其次，根据句子的相似度计算主题袋的相似度;然后通过主题袋的相似度和权重得到两篇论文的相似度。实验表明，该算法具有较高的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 International Conference on Web Information Systems and Mining

自引率

0.00%

发文量