Semantically enhanced term frequency based on word embeddings for Arabic information retrieval

Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Éric Gaussier
{"title":"Semantically enhanced term frequency based on word embeddings for Arabic information retrieval","authors":"Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Éric Gaussier","doi":"10.1109/CIST.2016.7805076","DOIUrl":null,"url":null,"abstract":"Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.","PeriodicalId":196827,"journal":{"name":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIST.2016.7805076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.
基于词嵌入的词频语义增强阿拉伯语信息检索
传统的信息检索(Information Retrieval, IR)模型是基于词袋模型(bag-of-words paradigm),基于关键词的精确匹配计算相关分数。尽管这些模型已经取得了良好的性能,但研究表明,大多数相关性不满意的情况是由于查询和文档之间的术语不匹配造成的。在本文中,我们引入了一种基于语义相似度计算词频的新方法,该方法使用向量空间中词的分布式表示(词嵌入)来计算词频。我们的主要目标是允许不同但语义相关的术语相互匹配,并贡献相关分数。因此,阿拉伯语文档的检索超越了基于词向量之间语义相似性的词袋范式。在阿拉伯语标准TREC数据集上的结果显示,与基线词袋模型相比,有显著的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信