加速结构化Web爬行而不丢失数据

International Conference on Information Integration and Web-based Applications & Services Pub Date : 2013-12-02 DOI:10.1145/2539150.2539203

B. R. El-Gamil, W. Winiwarter

{"title":"加速结构化Web爬行而不丢失数据","authors":"B. R. El-Gamil, W. Winiwarter","doi":"10.1145/2539150.2539203","DOIUrl":null,"url":null,"abstract":"Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.","PeriodicalId":424918,"journal":{"name":"International Conference on Information Integration and Web-based Applications & Services","volume":"357 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Structured Web Crawling without Losing Data\",\"authors\":\"B. R. El-Gamil, W. Winiwarter\",\"doi\":\"10.1145/2539150.2539203\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.\",\"PeriodicalId\":424918,\"journal\":{\"name\":\"International Conference on Information Integration and Web-based Applications & Services\",\"volume\":\"357 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Information Integration and Web-based Applications & Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2539150.2539203\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2539150.2539203","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在结构化Web爬行社区中，检索数据的大小与爬行时间形成了一个众所周知的困境。这个困境中的真正挑战是优化给定包装器的设置，以便在尽可能短的时间内获得最大的可用数据。在本文中，我们试图通过引入一个线程算法来调整这些设置，该算法保证在爬行范围内访问所有可用的详细页面;使用这个算法，我们尝试通过简单的调整每个细节页面访问后的睡眠时间来减少爬虫所消耗的时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Structured Web Crawling without Losing Data

Size of retrieved data versus crawling time formulate a well-known dilemma in the structured Web crawling community. The real challenge within this dilemma is to optimize the settings of a given wrapper to obtain maximum available data in shortest possible time. In this paper, we try to tune these settings, by introducing a threaded algorithm that guarantees accessing all available detail pages within crawling scope; and using this algorithm, we try to reduce the time consumed by the crawler, via simple adjustments of sleeping time after each detail page visit.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Information Integration and Web-based Applications & Services

自引率

0.00%

发文量