Oussama Ayoub, Christophe Rodrigues, Nicolas Travers
{"title":"LoGE: an unsupervised local-global document extension generation in information retrieval for long documents","authors":"Oussama Ayoub, Christophe Rodrigues, Nicolas Travers","doi":"10.1108/ijwis-07-2023-0109","DOIUrl":null,"url":null,"abstract":"\nPurpose\nThis paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains.\n\n\nDesign/methodology/approach\nTo generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation.\n\n\nFindings\nThe authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results.\n\n\nOriginality/value\nIn this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.\n","PeriodicalId":44153,"journal":{"name":"International Journal of Web Information Systems","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Web Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijwis-07-2023-0109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
This paper aims to manage the word gap in information retrieval (IR) especially for long documents belonging to specific domains. In fact, with the continuous growth of text data that modern IR systems have to manage, existing solutions are needed to efficiently find the best set of documents for a given request. The words used to describe a query can differ from those used in related documents. Despite meaning closeness, nonoverlapping words are challenging for IR systems. This word gap becomes significant for long documents from specific domains.
Design/methodology/approach
To generate new words for a document, a deep learning (DL) masked language model is used to infer related words. Used DL models are pretrained on massive text data and carry common or specific domain knowledge to propose a better document representation.
Findings
The authors evaluate the approach of this study on specific IR domains with long documents to show the genericity of the proposed model and achieve encouraging results.
Originality/value
In this paper, to the best of the authors’ knowledge, an original unsupervised and modular IR system based on recent DL methods is introduced.
期刊介绍:
The Global Information Infrastructure is a daily reality. In spite of the many applications in all domains of our societies: e-business, e-commerce, e-learning, e-science, and e-government, for instance, and in spite of the tremendous advances by engineers and scientists, the seamless development of Web information systems and services remains a major challenge. The journal examines how current shared vision for the future is one of semantically-rich information and service oriented architecture for global information systems. This vision is at the convergence of progress in technologies such as XML, Web services, RDF, OWL, of multimedia, multimodal, and multilingual information retrieval, and of distributed, mobile and ubiquitous computing. Topicality While the International Journal of Web Information Systems covers a broad range of topics, the journal welcomes papers that provide a perspective on all aspects of Web information systems: Web semantics and Web dynamics, Web mining and searching, Web databases and Web data integration, Web-based commerce and e-business, Web collaboration and distributed computing, Internet computing and networks, performance of Web applications, and Web multimedia services and Web-based education.