Learning Web Content Extraction with DOM Features

2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP) Pub Date : 2018-09-01 DOI:10.1109/ICCP.2018.8516632

Nichita Utiu, Vlad-Sebastian Ionescu

引用次数: 5

Abstract

Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling the task as a classification problem over HTML tags using features based on information from the DOM tree. Not only do we obtain a performance increase over current methods, but we do so with minimal feature engineering and without the extensive preprocessing steps of other methods.

查看原文本刊更多论文

学习用DOM特征提取Web内容

内容抽取是指将网页的主要内容从大量模板和装饰组件中分离出来的过程。我们提出了一种方法，该方法在Cleaneval数据集上实现了具有竞争力的性能，并在Dragnet数据集上设置了F1分数为0.96的新状态。我们通过使用基于DOM树信息的特性将任务建模为HTML标记上的分类问题来实现这一点。与现有方法相比，我们不仅获得了性能提升，而且只需要最少的特征工程，而不需要其他方法的大量预处理步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP)

自引率

0.00%

发文量