Recognition of common areas in a Web page using visual information: a possible application in a page classification

M. Kovačević, Michelangelo Diligenti, M. Gori, V. Milutinovic
{"title":"Recognition of common areas in a Web page using visual information: a possible application in a page classification","authors":"M. Kovačević, Michelangelo Diligenti, M. Gori, V. Milutinovic","doi":"10.1109/ICDM.2002.1183910","DOIUrl":null,"url":null,"abstract":"Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. A common approach in the extraction process is to represent a page as a \"bag of words\" and then to perform additional processing on such a flat representation. We propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"132","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2002.1183910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 132

Abstract

Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. A common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. We propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
使用视觉信息识别Web页面中的公共区域:在页面分类中可能的应用
从Web页面中提取和处理信息是许多领域的重要任务,例如构建搜索引擎、信息检索和从Web进行数据挖掘。提取过程中的一种常用方法是将页面表示为“单词包”,然后在这种平面表示上执行附加处理。我们提出了一种新的分层表示,其中包括页面中每个HTML对象的浏览器屏幕坐标。使用视觉信息可以定义识别常见页面区域的启发式方法,例如页眉、左右菜单、页脚和页面中心。我们在最初的实验中表明,使用我们的启发式定义的对象在73%的情况下被正确识别。最后,我们表明,考虑到提议的表示,朴素贝叶斯分类器明显优于仅使用关于文档内容的信息的相同分类器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信