Intelligent Distributed Web Crawler Based on Attention Mechanism

Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence Pub Date : 2020-10-17 DOI:10.1145/3438872.3439085

Yi Wu, Yan Song, Hongshan Yang

{"title":"Intelligent Distributed Web Crawler Based on Attention Mechanism","authors":"Yi Wu, Yan Song, Hongshan Yang","doi":"10.1145/3438872.3439085","DOIUrl":null,"url":null,"abstract":"With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.","PeriodicalId":199307,"journal":{"name":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3438872.3439085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.

查看原文本刊更多论文

基于注意力机制的智能分布式网络爬虫

随着互联网的飞速发展，网页内容已经成为人们发布和检索信息的中心平台。目前，网络爬虫可以从海量的网络信息资源中快速准确地找到用户需要的信息。文献中有许多不同类型的网络爬虫，用于数据检索。然而，大多数现有的网络爬虫都有明显的局限性。例如，他们关注的是有效的整体架构，而不是实际数据的复杂性。此外，新闻和公共平台的宣传内容中的广告链接已经成为无处不在的噪音。现有的网络爬虫收集策略缺乏对广告信息的充分识别。广告检测自动化程度较低，不易形成完整的、可部署的大规模分布式数据爬行系统。因此，研究和改进能够智能识别广告的分布式网络爬虫是一项具有现实意义的工作。本文设计并实现的分布式智能网络爬虫系统解决了人工爬虫效率低、数据质量差的问题。从实验结果来看，该爬虫系统可以有效地识别和消除广告信息，显著提高了分布式爬虫系统中自动提取数据的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence

自引率

0.00%

发文量