基于词嵌入的词频语义增强阿拉伯语信息检索

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) Pub Date : 2016-10-24 DOI:10.1109/CIST.2016.7805076

Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Éric Gaussier

{"title":"基于词嵌入的词频语义增强阿拉伯语信息检索","authors":"Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Éric Gaussier","doi":"10.1109/CIST.2016.7805076","DOIUrl":null,"url":null,"abstract":"Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.","PeriodicalId":196827,"journal":{"name":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Semantically enhanced term frequency based on word embeddings for Arabic information retrieval\",\"authors\":\"Abdelkader El Mahdaouy, Said Ouatik El Alaoui, Éric Gaussier\",\"doi\":\"10.1109/CIST.2016.7805076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.\",\"PeriodicalId\":196827,\"journal\":{\"name\":\"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIST.2016.7805076\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIST.2016.7805076","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

传统的信息检索(Information Retrieval, IR)模型是基于词袋模型(bag-of-words paradigm)，基于关键词的精确匹配计算相关分数。尽管这些模型已经取得了良好的性能，但研究表明，大多数相关性不满意的情况是由于查询和文档之间的术语不匹配造成的。在本文中，我们引入了一种基于语义相似度计算词频的新方法，该方法使用向量空间中词的分布式表示(词嵌入)来计算词频。我们的主要目标是允许不同但语义相关的术语相互匹配，并贡献相关分数。因此，阿拉伯语文档的检索超越了基于词向量之间语义相似性的词袋范式。在阿拉伯语标准TREC数据集上的结果显示，与基线词袋模型相比，有显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semantically enhanced term frequency based on word embeddings for Arabic information retrieval

Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this paper, we introduce novel method to compute term frequency based on semantic similarities using distributed representations of words in a vector space (Word Embeddings). Our main goal is to allow distinct but semantically related terms to match each other and contribute to the relevance scores. Hence, Arabic documents are retrieved beyond the bag-of-words paradigm based on semantic similarities between word vectors. The results on Arabic standard TREC data sets show significant improvement over the baseline bag-of-words models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 4th IEEE International Colloquium on Information Science and Technology (CiSt)

自引率

0.00%

发文量