LoGE: an unsupervised local-global document extension generation in information retrieval for long documents

IF 2.3 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Web Information Systems Pub Date : 2023-09-08 DOI:10.1108/ijwis-07-2023-0109

Oussama Ayoub, Christophe Rodrigues, Nicolas Travers

{"title":"LoGE: an unsupervised local-global document extension generation in information retrieval for long documents","authors":"Oussama Ayoub, Christophe Rodrigues, Nicolas Travers","doi":"10.1108/ijwis-07-2023-0109","DOIUrl":null,"url":null,"abstract":"\nPurpose\nThis paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains.\n\n\nDesign/methodology/approach\nTo generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation.\n\n\nFindings\nThe authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results.\n\n\nOriginality/value\nIn this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.\n","PeriodicalId":44153,"journal":{"name":"International Journal of Web Information Systems","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Web Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijwis-07-2023-0109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose This paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains. Design/methodology/approach To generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation. Findings The authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results. Originality/value In this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.

查看原文本刊更多论文

LoGE:用于长文档信息检索的无监督局部-全局文档扩展生成

目的本文旨在管理信息检索（IR）中的单词缺口，特别是对于属于特定领域的长文档。事实上，随着现代IR系统必须管理的文本数据的不断增长，需要现有的解决方案来有效地为给定的请求找到最佳的文档集。用于描述查询的词语可能与相关文档中使用的词语不同。尽管意思相近，但不重叠的单词对IR系统来说是具有挑战性的。对于来自特定领域的长文档来说，这种单词差距变得非常重要。设计/方法论/方法为了为文档生成新词，使用深度学习（DL）掩蔽语言模型来推断相关单词。使用的DL模型在海量文本数据上进行预训练，并携带公共或特定领域知识，以提出更好的文档表示。结果作者用长文档对本研究在特定IR领域的方法进行了评估，以显示所提出模型的通用性，并取得了令人鼓舞的结果。独创性/价值在本文中，据作者所知，介绍了一种基于最新DL方法的原始无监督和模块化IR系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Web Information Systems COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.60

自引率

0.00%

发文量

期刊介绍： The Global Information Infrastructure is a daily reality. In spite of the many applications in all domains of our societies: e-business, e-commerce, e-learning, e-science, and e-government, for instance, and in spite of the tremendous advances by engineers and scientists, the seamless development of Web information systems and services remains a major challenge. The journal examines how current shared vision for the future is one of semantically-rich information and service oriented architecture for global information systems. This vision is at the convergence of progress in technologies such as XML, Web services, RDF, OWL, of multimedia, multimodal, and multilingual information retrieval, and of distributed, mobile and ubiquitous computing. Topicality While the International Journal of Web Information Systems covers a broad range of topics, the journal welcomes papers that provide a perspective on all aspects of Web information systems: Web semantics and Web dynamics, Web mining and searching, Web databases and Web data integration, Web-based commerce and e-business, Web collaboration and distributed computing, Internet computing and networks, performance of Web applications, and Web multimedia services and Web-based education.