Bayes topic prediction model for focused crawling of vertical search engine

2014 IEEE Computers, Communications and IT Applications Conference Pub Date : 2014-10-01 DOI:10.1109/ComComAp.2014.7017213

Weihong Zhang, Yong Chen

{"title":"Bayes topic prediction model for focused crawling of vertical search engine","authors":"Weihong Zhang, Yong Chen","doi":"10.1109/ComComAp.2014.7017213","DOIUrl":null,"url":null,"abstract":"Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build \"Search Engine for S&T\" (http://nstr.com.cn/search), a vertical search engine in science field.","PeriodicalId":422906,"journal":{"name":"2014 IEEE Computers, Communications and IT Applications Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Computers, Communications and IT Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ComComAp.2014.7017213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build "Search Engine for S&T" (http://nstr.com.cn/search), a vertical search engine in science field.

查看原文本刊更多论文

垂直搜索引擎聚焦爬行的Bayes主题预测模型

垂直搜索是搜索引擎设计中的一个重要课题，因为相对于谷歌、百度等大型搜索引擎，垂直搜索能在特定领域提供更丰富、更精确的搜索结果。在本文之前，大多数垂直搜索引擎都是使用人工选择和编辑的材料来构建的，这既费时又费钱。本文提出了一种新的信息资源发现模型，并在垂直搜索引擎中构建了一个爬虫，该爬虫可以选择性地获取与预定义主题相关的网页。该模型包括三个方面。首先，将网页转换为术语向量。TF-TUF是术语频率-主题不平衡因子(Term Frequency-Topic imbalance Factor，简称TF-TUF)的缩写，提出了向量空间模型中的权重模式。在模式中，我们将更多的权重放在频率在主题之间差异很大的术语上，这将有助于我们相信的主题预测。其次，我们使用贝叶斯方法预测网页的主题，其中主题标记文本用于提前训练。使用贝叶斯预测主题的具体方法在算法部分进行了说明。第三，我们使用主题预测结果创建一个聚焦爬虫。预测结果不仅用于过滤不相关的网页，而且还用于引导爬虫到最有可能与主题相关的区域。在构建垂直搜索引擎时，这三个方面共同努力，以达到高效地发现网络上的主题相关资料的目的。实验表明，该模型的平均预测精度可以达到85%以上。在应用方面，我们还利用该模型构建了科技领域的垂直搜索引擎“科技搜索引擎”(http://nstr.com.cn/search)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE Computers, Communications and IT Applications Conference

自引率

0.00%

发文量