History-enhanced focused website segment crawler

Tanaphol Suebchua, Bundit Manaskasemsak, A. Rungsawang, H. Yamana
{"title":"History-enhanced focused website segment crawler","authors":"Tanaphol Suebchua, Bundit Manaskasemsak, A. Rungsawang, H. Yamana","doi":"10.1109/ICOIN.2018.8343090","DOIUrl":null,"url":null,"abstract":"The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a “history-enhanced focused website segment crawler” to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the “history feature”, that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.","PeriodicalId":228799,"journal":{"name":"2018 International Conference on Information Networking (ICOIN)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Information Networking (ICOIN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOIN.2018.8343090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a “history-enhanced focused website segment crawler” to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the “history feature”, that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.
历史增强的重点网站分段爬虫
集中爬行研究的主要挑战是如何有效地利用计算资源,例如带宽、磁盘空间和时间,以找到尽可能多的与特定主题相关的网页。为了应对这一挑战,我们之前引入了一种基于机器学习的聚焦爬虫,旨在抓取位于同一目录路径下的一组相关网页,称为网站段,迄今为止已经取得了很高的效率。我们之前的方法的局限性之一是,它可能会重复访问一个不提供任何相关网站段的网站,在这种情况下,网站段与训练数据集中的相关网站段具有相同的链接特征。在本文中,我们提出了一个“历史增强型聚焦网站分段爬虫”来解决这个问题。其背后的想法是,如果爬虫从网站连续下载了许多不相关的网页,则应降低未访问网站段的优先级分数。为了实现这个想法,我们提出了一个新的预测特征,称为“历史特征”,它是从最近的抓取结果中提取的,即从目标网站收集的相关和不相关的网页。我们的实验表明,我们新提出的特征可以将我们的聚焦爬虫的爬行效率提高最多约5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信