Recognition of common areas in a Web page using visual information: a possible application in a page classification

2002 IEEE International Conference on Data Mining, 2002. Proceedings. Pub Date : 2002-12-09 DOI:10.1109/ICDM.2002.1183910

M. Kovačević, Michelangelo Diligenti, M. Gori, V. Milutinovic

引用次数: 132

Abstract

Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. A common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. We propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

查看原文本刊更多论文

使用视觉信息识别Web页面中的公共区域:在页面分类中可能的应用

从Web页面中提取和处理信息是许多领域的重要任务，例如构建搜索引擎、信息检索和从Web进行数据挖掘。提取过程中的一种常用方法是将页面表示为“单词包”，然后在这种平面表示上执行附加处理。我们提出了一种新的分层表示，其中包括页面中每个HTML对象的浏览器屏幕坐标。使用视觉信息可以定义识别常见页面区域的启发式方法，例如页眉、左右菜单、页脚和页面中心。我们在最初的实验中表明，使用我们的启发式定义的对象在73%的情况下被正确识别。最后，我们表明，考虑到提议的表示，朴素贝叶斯分类器明显优于仅使用关于文档内容的信息的相同分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2002 IEEE International Conference on Data Mining, 2002. Proceedings.

自引率

0.00%

发文量