从自然语言处理的角度看新颖性检测

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya
{"title":"从自然语言处理的角度看新颖性检测","authors":"Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya","doi":"10.1162/coli_a_00429","DOIUrl":null,"url":null,"abstract":"The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"77-117"},"PeriodicalIF":3.7000,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Novelty Detection: A Perspective from Natural Language Processing\",\"authors\":\"Tirthankar Ghosal, Tanik Saikh, Tameesh Biswas, Asif Ekbal, P. Bhattacharyya\",\"doi\":\"10.1162/coli_a_00429\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.\",\"PeriodicalId\":55229,\"journal\":{\"name\":\"Computational Linguistics\",\"volume\":\"48 1\",\"pages\":\"77-117\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2021-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Linguistics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/coli_a_00429\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00429","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 5

摘要

对新信息的追求是人类与生俱来的特征,一直是人类生存和进步的精髓。新奇激发好奇心,好奇心反过来又推动创新。在自然语言处理(NLP)中,新颖性检测(Novelty Detection)指的是找到相对于之前所见或已知的内容具有一些新信息的文本。随着整个Web上信息的指数级增长,冗余的威胁也随之而来。相当一部分Web内容是重复的,我们需要有效的机制来保留新信息并过滤掉冗余信息。然而,在语义层面检测冗余和识别新文本并不简单,因为文本可能具有较少的词汇重叠,但传达相同的信息。最重要的是,文档中的非新颖/冗余信息可能来自多个源文档,而不仅仅是一个源文档。当话语的主题是文件时,这个问题就会超越,并且需要处理许多先前的文件以确定当前所关注的文件的新颖性/非新颖性。在这项工作中,我们建立在我们早期对文档级新颖性检测的调查基础上,并对我们针对该问题所做的努力进行了全面的描述。我们探索了预先训练的文本蕴涵(TE)模型在处理多源上下文中的作用,并展示了我们当前研究的结果。我们认为多前提蕴涵任务是识别语义级非新颖性的一种近似方法。我们最近的方法在几个数据集和几个相关任务(释义、抄袭、重写)上,与最新报告的结果相比,要么表现相当,要么取得了显著的改进。我们批判性地分析了我们对现有技术的表现,并展示了我们的方法在未来调查中的优越性和前景。我们还向社区展示了我们的增强数据集tap - dnd 2.0和几个基线,以进一步研究文档级新颖性检测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Novelty Detection: A Perspective from Natural Language Processing
The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信