Web Crawler: Design And Implementation For Extracting Article-Like Contents

Q3 Physics and Astronomy

Cybernetics and Physics Pub Date : 2020-11-30 DOI:10.35470/2226-4116-2020-9-3-144-151

Ngo Le Huy Hien, Thai Quang Tien, Hieu Nguyen Van

{"title":"Web Crawler: Design And Implementation For Extracting Article-Like Contents","authors":"Ngo Le Huy Hien, Thai Quang Tien, Hieu Nguyen Van","doi":"10.35470/2226-4116-2020-9-3-144-151","DOIUrl":null,"url":null,"abstract":"The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search\nengines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.","PeriodicalId":37674,"journal":{"name":"Cybernetics and Physics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cybernetics and Physics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35470/2226-4116-2020-9-3-144-151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Physics and Astronomy","Score":null,"Total":0}

引用次数: 6

Abstract

The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.

查看原文本刊更多论文

Web爬网程序：文章类内容提取的设计与实现

万维网是一个庞大、丰富、可访问的信息系统，其用户在当今迅速增加。为了根据用户的请求从网络上检索信息，搜索引擎被构建为访问网页。由于搜索引擎系统在控制论、电信和物理学中发挥着重要作用，人们做出了许多努力来提高其能力。然而，网络上包含的大多数数据都是未经管理的，这使得当前的搜索引擎系统机制无法同时访问整个网络。因此，网络爬虫是搜索引擎中导航和下载网页全文的关键部分。网络爬虫还可以应用于检测丢失的链接以及复杂网络和控制论系统中的社区检测。然而，基于模板的抓取技术无法处理网页中对象的布局多样性。本文设计并实现了一个网络爬虫模块，试图从495个网站中提取类似文章的内容。它使用机器学习方法，通过视觉提示、琐碎的HTML和基于文本的功能来过滤混乱。这些结果有望从网站中提取类似文章的内容，有助于搜索引擎系统的开发和未来的研究，以提出更高性能的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cybernetics and Physics Chemical Engineering-Fluid Flow and Transfer Processes

CiteScore

1.70

自引率

0.00%

发文量

审稿时长

10 weeks

期刊介绍： The scope of the journal includes: -Nonlinear dynamics and control -Complexity and self-organization -Control of oscillations -Control of chaos and bifurcations -Control in thermodynamics -Control of flows and turbulence -Information Physics -Cyber-physical systems -Modeling and identification of physical systems -Quantum information and control -Analysis and control of complex networks -Synchronization of systems and networks -Control of mechanical and micromechanical systems -Dynamics and control of plasma, beams, lasers, nanostructures -Applications of cybernetic methods in chemistry, biology, other natural sciences The papers in cybernetics with physical flavor as well as the papers in physics with cybernetic flavor are welcome. Cybernetics is assumed to include, in addition to control, such areas as estimation, filtering, optimization, identification, information theory, pattern recognition and other related areas.