Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval

Mumthaz Beegum M, A. S, Raveena Vijayan
{"title":"Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval","authors":"Mumthaz Beegum M, A. S, Raveena Vijayan","doi":"10.1109/ICCSC56913.2023.10142977","DOIUrl":null,"url":null,"abstract":"Document retrieval will become challenging when it deals with the unique capability of natural languages to present content in different forms using synonyms, usages, and their complex combinations. Most of the existing information retrieval systems are struggling to retrieve documents with a similar meaning, and they are helpful only to get documents based on matching keywords. The query expansion is a logically simple and straightforward technique to improve the effectiveness of information retrieval in this background. The existing statistical approach depends mainly on the term frequency to generate candidate documents for the expanded or normal query. Most of the existing works do not consider the ways in which the content in a particular document can be represented differently by keeping the same context. This paper proposes a novel Synonym Weighted - Vector Space Model and query expansion technique for an effective synonym-incorporated method for document retrieval. The combination of modified Term Frequency - Inverse Document Frequency(TF-IDF) and synonym extended VSM has given a promising outcome for the experiments throughout the study. The proposed method is validated with two English-written publicly available datasets - CACM and CISI. The quantitative measures, like mean average precision, precision, recall, and F-measure obtained in the experiments are found to be better for the proposed method compared with the classical VSM and other baseline methods in the problem domain. We could obtain the highest precision of 0.83 and 0.65 for the CACM and CISI datasets respectively.","PeriodicalId":184366,"journal":{"name":"2023 2nd International Conference on Computational Systems and Communication (ICCSC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference on Computational Systems and Communication (ICCSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSC56913.2023.10142977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Document retrieval will become challenging when it deals with the unique capability of natural languages to present content in different forms using synonyms, usages, and their complex combinations. Most of the existing information retrieval systems are struggling to retrieve documents with a similar meaning, and they are helpful only to get documents based on matching keywords. The query expansion is a logically simple and straightforward technique to improve the effectiveness of information retrieval in this background. The existing statistical approach depends mainly on the term frequency to generate candidate documents for the expanded or normal query. Most of the existing works do not consider the ways in which the content in a particular document can be represented differently by keeping the same context. This paper proposes a novel Synonym Weighted - Vector Space Model and query expansion technique for an effective synonym-incorporated method for document retrieval. The combination of modified Term Frequency - Inverse Document Frequency(TF-IDF) and synonym extended VSM has given a promising outcome for the experiments throughout the study. The proposed method is validated with two English-written publicly available datasets - CACM and CISI. The quantitative measures, like mean average precision, precision, recall, and F-measure obtained in the experiments are found to be better for the proposed method compared with the classical VSM and other baseline methods in the problem domain. We could obtain the highest precision of 0.83 and 0.65 for the CACM and CISI datasets respectively.
同义词不敏感搜索:一个新的同义词加权向量空间文档检索模型
当处理自然语言使用同义词、用法及其复杂组合以不同形式表示内容的独特能力时,文档检索将变得具有挑战性。现有的信息检索系统大多难以检索到具有相似含义的文档,只有基于匹配的关键字才能检索到文档。在这种背景下,查询扩展是一种逻辑简单、直观的提高信息检索效率的技术。现有的统计方法主要依赖于词频来为扩展查询或普通查询生成候选文档。大多数现有的工作都没有考虑通过保持相同的上下文来表示特定文档中的内容的不同方式。本文提出了一种新的同义词加权向量空间模型和查询扩展技术,用于有效的同义词合并文档检索。将改进词频-逆文档频率(TF-IDF)与同义词扩展VSM相结合,在整个研究过程中为实验提供了良好的结果。该方法在两个英文公开数据集(ccam和CISI)上进行了验证。实验结果表明,该方法在问题域的平均精密度、精密度、召回率和f测度等定量指标上优于经典的VSM方法和其他基线方法。在ccam和CISI数据集上,我们分别获得了0.83和0.65的最高精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信