Intelligent Distributed Web Crawler Based on Attention Mechanism

Yi Wu, Yan Song, Hongshan Yang
{"title":"Intelligent Distributed Web Crawler Based on Attention Mechanism","authors":"Yi Wu, Yan Song, Hongshan Yang","doi":"10.1145/3438872.3439085","DOIUrl":null,"url":null,"abstract":"With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.","PeriodicalId":199307,"journal":{"name":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3438872.3439085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

With the rapid development of the Internet, webpages' content has become the central platform for people to publish and retrieve information. Recently, web crawlers could quickly and accurately find the information users need from the massive network information resources. There have been many different types of web crawlers in the literature, developed for data retrieval. However, most of the existing web crawlers have significant limitations. For example, they focus on the effective overall architecture instead of paying attention to the actual data's complexity. Moreover, the advertising links in the news and the public platform's promotional content have become ubiquitous noise. The existing web crawler collection strategy lacks sufficient identification of advertising information. The degree of automation to detect advertisements is low, so it isn't easy to form a complete and deployable large-scale distributed data crawling system. Therefore, the research and improvement of distributed web crawlers that intelligently distinguish advertisements is a work of practical significance. The distributed intelligent web crawler system designed and implemented in this paper solves low manual crawler efficiency and poor data quality. The crawler system can effectively identify and eliminate advertising information and significantly improve the automatically extracted data in the distributed crawler system from the experimental results.
基于注意力机制的智能分布式网络爬虫
随着互联网的飞速发展,网页内容已经成为人们发布和检索信息的中心平台。目前,网络爬虫可以从海量的网络信息资源中快速准确地找到用户需要的信息。文献中有许多不同类型的网络爬虫,用于数据检索。然而,大多数现有的网络爬虫都有明显的局限性。例如,他们关注的是有效的整体架构,而不是实际数据的复杂性。此外,新闻和公共平台的宣传内容中的广告链接已经成为无处不在的噪音。现有的网络爬虫收集策略缺乏对广告信息的充分识别。广告检测自动化程度较低,不易形成完整的、可部署的大规模分布式数据爬行系统。因此,研究和改进能够智能识别广告的分布式网络爬虫是一项具有现实意义的工作。本文设计并实现的分布式智能网络爬虫系统解决了人工爬虫效率低、数据质量差的问题。从实验结果来看,该爬虫系统可以有效地识别和消除广告信息,显著提高了分布式爬虫系统中自动提取数据的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信