面向信息检索的段落向量模型分析

Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft
{"title":"面向信息检索的段落向量模型分析","authors":"Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft","doi":"10.1145/2970398.2970409","DOIUrl":null,"url":null,"abstract":"Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.","PeriodicalId":443715,"journal":{"name":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"79","resultStr":"{\"title\":\"Analysis of the Paragraph Vector Model for Information Retrieval\",\"authors\":\"Qingyao Ai, Liu Yang, Jiafeng Guo, W. Bruce Croft\",\"doi\":\"10.1145/2970398.2970409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.\",\"PeriodicalId\":443715,\"journal\":{\"name\":\"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"79\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2970398.2970409\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2970398.2970409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 79

摘要

以往的研究表明,通过神经嵌入模型可以获得词和文本的语义意义表征。特别是段落向量(PV)模型通过估计文档(主题)级语言模型在一些自然语言处理任务中表现出令人印象深刻的性能。然而,将PV模型与传统的语言模型方法集成在一起进行检索,会产生不稳定的性能和有限的改进。在本文中,我们正式讨论了原PV模型在检索任务中限制其性能的三个内在问题。我们还描述了对模型的修改,使其更适合红外任务,并通过实验和案例研究展示了它们的影响。我们解决的三个问题是:(1)不规范的PV训练过程容易产生短文档过拟合,从而在最终的检索模型中产生长度偏差;(2)基于语料库的PV负抽样导致单词权重方案过度抑制了频繁词的重要性;(3)由于缺乏词-上下文信息,使得PV无法捕捉词替换关系。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analysis of the Paragraph Vector Model for Information Retrieval
Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信