{"title":"Bayes topic prediction model for focused crawling of vertical search engine","authors":"Weihong Zhang, Yong Chen","doi":"10.1109/ComComAp.2014.7017213","DOIUrl":null,"url":null,"abstract":"Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build \"Search Engine for S&T\" (http://nstr.com.cn/search), a vertical search engine in science field.","PeriodicalId":422906,"journal":{"name":"2014 IEEE Computers, Communications and IT Applications Conference","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Computers, Communications and IT Applications Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ComComAp.2014.7017213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build "Search Engine for S&T" (http://nstr.com.cn/search), a vertical search engine in science field.