网络爬虫的交叉监督合成

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) Pub Date : 2016-05-14 DOI:10.1145/2884781.2884842

Adi Omari, Sharon Shoham, Eran Yahav

{"title":"网络爬虫的交叉监督合成","authors":"Adi Omari, Sharon Shoham, Eran Yahav","doi":"10.1145/2884781.2884842","DOIUrl":null,"url":null,"abstract":"A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ dramatically. Manually customizing a crawler for each specific site is time consuming and error-prone. Furthermore, because sites periodically change their format and presentation, crawling schemes have to be manually updated and adjusted. In this paper, we present a technique for automatic synthesis of web-crawlers from examples. The main idea is to use hand-crafted (possibly partial) crawlers for some websites as the basis for crawling other sites that contain the same kind of information. Technically, we use the data on one site to identify data on another site. We then use the identified data to learn the website structure and synthesize an appropriate extraction scheme. We iterate this process, as synthesized extraction schemes result in additional data to be used for re-learning the website structure. We implemented our approach and automatically synthesized 30 crawlers for websites from nine different categories: books, TVs, conferences, universities, cameras, phones, movies, songs, and hotels.","PeriodicalId":6485,"journal":{"name":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","volume":"18 1","pages":"368-379"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Cross-Supervised Synthesis of Web-Crawlers\",\"authors\":\"Adi Omari, Sharon Shoham, Eran Yahav\",\"doi\":\"10.1145/2884781.2884842\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ dramatically. Manually customizing a crawler for each specific site is time consuming and error-prone. Furthermore, because sites periodically change their format and presentation, crawling schemes have to be manually updated and adjusted. In this paper, we present a technique for automatic synthesis of web-crawlers from examples. The main idea is to use hand-crafted (possibly partial) crawlers for some websites as the basis for crawling other sites that contain the same kind of information. Technically, we use the data on one site to identify data on another site. We then use the identified data to learn the website structure and synthesize an appropriate extraction scheme. We iterate this process, as synthesized extraction schemes result in additional data to be used for re-learning the website structure. We implemented our approach and automatically synthesized 30 crawlers for websites from nine different categories: books, TVs, conferences, universities, cameras, phones, movies, songs, and hotels.\",\"PeriodicalId\":6485,\"journal\":{\"name\":\"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)\",\"volume\":\"18 1\",\"pages\":\"368-379\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2884781.2884842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2884781.2884842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

网络爬虫是一种自动系统地跟踪网站链接并从其页面中提取信息的程序。由于网站的格式不同，不同网站的抓取方案可能会有很大的不同。手动为每个特定站点定制爬虫既耗时又容易出错。此外，由于网站会定期更改其格式和表示，因此必须手动更新和调整爬行方案。本文从实例出发，提出了一种自动合成网络爬虫的技术。主要思想是对一些网站使用手工制作的(可能是部分的)抓取工具，作为抓取包含相同类型信息的其他网站的基础。从技术上讲，我们使用一个站点上的数据来识别另一个站点上的数据。然后利用识别出的数据来了解网站结构，并综合出合适的提取方案。我们重复这个过程，因为综合的提取方案会产生额外的数据，用于重新学习网站结构。我们实现了我们的方法，并自动合成了30个爬虫，用于9个不同类别的网站:书籍、电视、会议、大学、相机、电话、电影、歌曲和酒店。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-Supervised Synthesis of Web-Crawlers

A web-crawler is a program that automatically and systematically tracks the links of a website and extracts information from its pages. Due to the different formats of websites, the crawling scheme for different sites can differ dramatically. Manually customizing a crawler for each specific site is time consuming and error-prone. Furthermore, because sites periodically change their format and presentation, crawling schemes have to be manually updated and adjusted. In this paper, we present a technique for automatic synthesis of web-crawlers from examples. The main idea is to use hand-crafted (possibly partial) crawlers for some websites as the basis for crawling other sites that contain the same kind of information. Technically, we use the data on one site to identify data on another site. We then use the identified data to learn the website structure and synthesize an appropriate extraction scheme. We iterate this process, as synthesized extraction schemes result in additional data to be used for re-learning the website structure. We implemented our approach and automatically synthesized 30 crawlers for websites from nine different categories: books, TVs, conferences, universities, cameras, phones, movies, songs, and hotels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量