Bagging to find better expansion words

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010) Pub Date : 2010-09-30 DOI:10.1109/NLPKE.2010.5587826

Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang

{"title":"Bagging to find better expansion words","authors":"Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang","doi":"10.1109/NLPKE.2010.5587826","DOIUrl":null,"url":null,"abstract":"The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NLPKE.2010.5587826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.

查看原文本刊更多论文

寻找更好的扩展词

将监督学习应用到查询扩展技术中，训练一个模型来预测扩展词对检索系统的“良度”或“效用”。有许多特征可以用来衡量扩展词与查询之间的相关性，这些特征可以被纳入监督学习中来选择扩展词。训练数据集是通过一种复杂的方法自动生成的。然而，这种方法会受到许多方面的影响。一个严重的问题是特征的分布是查询相关的，这在以前的工作中没有讨论过。由于特征的分布不同，将这些训练实例合并在一起并使用整个数据集来训练单个模型是有问题的。在本文中，我们首先研究了自动生成的训练数据的统计分布，并指出了训练数据集中存在的问题。在分析的基础上，我们提出采用bagging方法对多个回归模型进行集成，以得到一个更好的监督模型来对扩展项进行预测。我们在TREC基准测试集合上进行了实验。我们对训练数据的分析揭示了一些关于查询扩展技术的有趣现象。实验结果还表明，套袋方法可以在标准TREC数据集上达到最先进的检索性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)

自引率

0.00%

发文量