{"title":"并行网络爬虫的动态URL分配方法","authors":"A. Guerriero, F. Ragni, Claudio Martines","doi":"10.1109/CIMSA.2010.5611764","DOIUrl":null,"url":null,"abstract":"A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.","PeriodicalId":162890,"journal":{"name":"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"A dynamic URL assignment method for parallel web crawler\",\"authors\":\"A. Guerriero, F. Ragni, Claudio Martines\",\"doi\":\"10.1109/CIMSA.2010.5611764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.\",\"PeriodicalId\":162890,\"journal\":{\"name\":\"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIMSA.2010.5611764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIMSA.2010.5611764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A dynamic URL assignment method for parallel web crawler
A web crawler is a relatively simple automated program or script that methodically scans or “crawls” through Internet pages to retrieval information from data. Alternative names for a web crawler include web spider, web robot, bot, crawler, and automatic indexer. There are many different uses for a web crawler. Their primary purpose is to collect data so that when Internet surfers enter a search term on their site, they can quickly provide the surfer with relevant web sites. In this work we propose the model of a low cost web crawler for distributed environments based on an efficient URL assignment algorithm. The function of every module of the crawler is analyzed and main rules that crawlers must follow to maintain load balancing and robustness of system when they are searching on the web simultaneously, are discussed. The proposed a dynamic URL assignment method, based on grid computing technology and dynamic clustering, results efficient increasing web crawler performance.