Bayes topic prediction model for focused crawling of vertical search engine

Weihong Zhang, Yong Chen
{"title":"Bayes topic prediction model for focused crawling of vertical search engine","authors":"Weihong Zhang, Yong Chen","doi":"10.1109/ComComAp.2014.7017213","DOIUrl":null,"url":null,"abstract":"Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build \"Search Engine for S&T\" (http://nstr.com.cn/search), a vertical search engine in science field.","PeriodicalId":422906,"journal":{"name":"2014 IEEE Computers, Communications and IT Applications Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Computers, Communications and IT Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ComComAp.2014.7017213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build "Search Engine for S&T" (http://nstr.com.cn/search), a vertical search engine in science field.
垂直搜索引擎聚焦爬行的Bayes主题预测模型
垂直搜索是搜索引擎设计中的一个重要课题,因为相对于谷歌、百度等大型搜索引擎,垂直搜索能在特定领域提供更丰富、更精确的搜索结果。在本文之前,大多数垂直搜索引擎都是使用人工选择和编辑的材料来构建的,这既费时又费钱。本文提出了一种新的信息资源发现模型,并在垂直搜索引擎中构建了一个爬虫,该爬虫可以选择性地获取与预定义主题相关的网页。该模型包括三个方面。首先,将网页转换为术语向量。TF-TUF是术语频率-主题不平衡因子(Term Frequency-Topic imbalance Factor,简称TF-TUF)的缩写,提出了向量空间模型中的权重模式。在模式中,我们将更多的权重放在频率在主题之间差异很大的术语上,这将有助于我们相信的主题预测。其次,我们使用贝叶斯方法预测网页的主题,其中主题标记文本用于提前训练。使用贝叶斯预测主题的具体方法在算法部分进行了说明。第三,我们使用主题预测结果创建一个聚焦爬虫。预测结果不仅用于过滤不相关的网页,而且还用于引导爬虫到最有可能与主题相关的区域。在构建垂直搜索引擎时,这三个方面共同努力,以达到高效地发现网络上的主题相关资料的目的。实验表明,该模型的平均预测精度可以达到85%以上。在应用方面,我们还利用该模型构建了科技领域的垂直搜索引擎“科技搜索引擎”(http://nstr.com.cn/search)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信