Page-Level Main Content Extraction From Heterogeneous Webpages

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-06-28 DOI:10.1145/3451168

Julián Alarte, Josep Silva

{"title":"Page-Level Main Content Extraction From Heterogeneous Webpages","authors":"Julián Alarte, Josep Silva","doi":"10.1145/3451168","DOIUrl":null,"url":null,"abstract":"The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3451168","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.

查看原文本刊更多论文

从异构网页中提取页面级主要内容

网页的主要内容通常被与模板相关的其他样板元素所包围，例如菜单、广告、版权声明和注释。对于爬虫和索引器来说，将主要内容与模板和其他噪声信息隔离开来是一项必不可少的任务，因为处理和存储噪声信息会浪费带宽、存储空间和计算时间等资源。此外，主要内容的检测和提取在数据挖掘、web摘要和低分辨率内容适应等不同领域都很有用。本文介绍了一种新的主内容提取技术。与大多数技术相比，这种技术不仅可以提取文本，还可以提取其他类型的内容，如图像和动画。它是一种基于文档对象模型的页面级技术，因此它只需要加载一个网页就可以提取主要内容。因此，它足够有效，可以在线(实时)使用。我们使用一套真正的异构基准测试对该技术进行了经验评估，与其他知名的内容提取技术相比，产生了非常好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量