Detecting duplicates with shallow and parser-based methods

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010) Pub Date : 2010-09-30 DOI:10.1109/NLPKE.2010.5587838

Sven Hartrumpf, Tim vor der Brück, Christian Eichhorn

{"title":"Detecting duplicates with shallow and parser-based methods","authors":"Sven Hartrumpf, Tim vor der Brück, Christian Eichhorn","doi":"10.1109/NLPKE.2010.5587838","DOIUrl":null,"url":null,"abstract":"Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized semantic network index. In order to detect many kinds of paraphrases the current base semantic network is varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Some important phenomena occurring in difficult-to-detect duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora like Wikipedia is explained briefly. This deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably, in comparison to traditional shallow methods. For the evaluation, a standard corpus of German plagiarisms was extended by four diverse components with an emphasis on duplicates (and not just plagiarisms), e.g., news feed articles from different web sources and two translations of the same short story.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NLPKE.2010.5587838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Identifying duplicate texts is important in many areas like plagiarism detection, information retrieval, text summarization, and question answering. Current approaches are mostly surface-oriented (or use only shallow syntactic representations) and see each text only as a token list. In this work however, we describe a deep, semantically oriented method based on semantic networks which are derived by a syntactico-semantic parser. Semantically identical or similar semantic networks for each sentence of a given base text are efficiently retrieved by using a specialized semantic network index. In order to detect many kinds of paraphrases the current base semantic network is varied by applying inferences: lexico-semantic relations, relation axioms, and meaning postulates. Some important phenomena occurring in difficult-to-detect duplicates are discussed. The deep approach profits from background knowledge, whose acquisition from corpora like Wikipedia is explained briefly. This deep duplicate recognizer is combined with two shallow duplicate recognizers in order to guarantee high recall for texts which are not fully parsable. The evaluation shows that the combined approach preserves recall and increases precision considerably, in comparison to traditional shallow methods. For the evaluation, a standard corpus of German plagiarisms was extended by four diverse components with an emphasis on duplicates (and not just plagiarisms), e.g., news feed articles from different web sources and two translations of the same short story.

查看原文本刊更多论文

使用浅方法和基于解析器的方法检测重复项

识别重复文本在许多领域都很重要，比如抄袭检测、信息检索、文本摘要和问题回答。当前的方法大多是面向表面的(或只使用浅层语法表示)，并且只将每个文本视为一个标记列表。然而，在这项工作中，我们描述了一种基于语义网络的深度，面向语义的方法，该方法由语法语义解析器派生。利用专门的语义网络索引，对给定基础文本中每个句子的语义相同或相似的语义网络进行高效检索。为了检测多种类型的释义，现有的基础语义网络通过运用推理:词汇语义关系、关系公理和意义公设来进行变化。讨论了在难以检测的重复中出现的一些重要现象。深层方法得益于背景知识，本文简要地解释了从维基百科等语料库获取背景知识的方法。该深度重复识别器与两个浅重复识别器相结合，以保证对无法完全解析的文本的高召回率。评估结果表明，与传统的浅层方法相比，组合方法在保留召回率的同时显著提高了准确率。为了进行评估，标准的德语剽窃语料库被扩展为四个不同的组成部分，重点是重复(而不仅仅是剽窃)，例如，来自不同网络来源的新闻feed文章和同一短篇小说的两个翻译。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)

自引率

0.00%

发文量