Extracting Records from the Web Using a Signal Processing Approach

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management Pub Date : 2017-11-06 DOI:10.1145/3132847.3132875

R. P. Velloso, C. Dorneles

引用次数: 7

Abstract

Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a trivial one. Due to the scale of data, a feasible approach must be both automatic and efficient (and of course effective). We present here a novel approach, fully automatic and computationally efficient, using signal processing techniques to detect regularities and patterns in the structure of web pages. Our approach segments the web page, detects the data regions within it, identifies the records boundaries and aligns the records. Results show high f-score and linearithmic time complexity behaviour.

查看原文本刊更多论文

使用信号处理方法从Web中提取记录

从网页中提取记录可以实现许多重要的应用程序，并且由于可以提取的可用信息的数量和多样性而具有巨大的价值。这个问题虽然被广泛研究，但仍然没有定论，因为它不是一个微不足道的问题。由于数据的规模，一个可行的方法必须是自动和高效的(当然是有效的)。我们在这里提出了一种全新的方法，全自动和计算效率，使用信号处理技术来检测网页结构中的规律和模式。我们的方法将网页分段，检测其中的数据区域，识别记录边界并对齐记录。结果显示高f-得分和线性时间复杂度行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

自引率

0.00%

发文量