{"title":"Intelligent Distributed Web Crawler Based on Attention Mechanism","authors":"Yi Wu, Yan Song, Hongshan Yang","doi":"10.1145/3438872.3439085","DOIUrl":null,"url":null,"abstract":"With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.","PeriodicalId":199307,"journal":{"name":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3438872.3439085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.