{"title":"Multi-word terms selection for information retrieval","authors":"Chedi Bechikh Ali, Hatem Haddad, Y. Slimani","doi":"10.1108/idd-12-2021-0142","DOIUrl":null,"url":null,"abstract":"\nPurpose\nA number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing.\n\n\nDesign/methodology/approach\nIn this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets.\n\n\nFindings\nThe results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced models based on terms dependencies.\n\n\nOriginality/value\nUsing and testing different association measures to select MWT that best describe the documents to enhance the precision in the first retrieved documents.\n","PeriodicalId":43488,"journal":{"name":"Information Discovery and Delivery","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Discovery and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/idd-12-2021-0142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 2
Abstract
Purpose
A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing.
Design/methodology/approach
In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets.
Findings
The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced models based on terms dependencies.
Originality/value
Using and testing different association measures to select MWT that best describe the documents to enhance the precision in the first retrieved documents.
期刊介绍:
Information Discovery and Delivery covers information discovery and access for digital information researchers. This includes educators, knowledge professionals in education and cultural organisations, knowledge managers in media, health care and government, as well as librarians. The journal publishes research and practice which explores the digital information supply chain ie transport, flows, tracking, exchange and sharing, including within and between libraries. It is also interested in digital information capture, packaging and storage by ‘collectors’ of all kinds. Information is widely defined, including but not limited to: Records, Documents, Learning objects, Visual and sound files, Data and metadata and , User-generated content.