A monolingual approach to detection of text reuse in Russian-English collection

2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT) Pub Date : 2015-11-01 DOI:10.1109/AINL-ISMW-FRUCT.2015.7382960

O. Bakhteev, Rita Kuznetsova, A. Romanov, A. Khritankov

引用次数: 2

Abstract

In this paper we develop a method for cross-lingual (Russian and English) text reuse detection. The method is based on the monolingual approach - translation of texts into one language and reduction to the text similarity problem. We split texts into non-overlapping fragments and compare fragments to each other by means of different metrics - BLEU(1-2), ME-TEOR, cosine similarity between bag-of-words representations of each snippet, and cosine similarity between vectors obtained from doc2vec-trained model. We explore the impact of choice of metric on the quality of text reuse detection. We assess quality of the method on a sample of a hundred scientific documents, originally in Russian, machine translated into English. Preliminary findings demonstrate feasibility of the approach.

查看原文本刊更多论文

俄语-英语文本重复使用的单语检测方法

本文提出了一种跨语言(俄语和英语)文本重用检测方法。该方法基于单语方法——将文本翻译成一种语言并简化为文本相似度问题。我们将文本分割为不重叠的片段，并通过不同的度量- BLEU(1-2)， ME-TEOR，每个片段的词袋表示之间的余弦相似度以及从doc2vec训练模型中获得的向量之间的余弦相似度来比较片段之间的相互比较。我们探讨了度量的选择对文本重用检测质量的影响。我们对100份科学文献样本的质量进行了评估，这些文献最初是俄语，机器翻译成英语。初步结果证明了该方法的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT)

自引率

0.00%

发文量