Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation

Jinbeom Kang, Joongmin Choi
{"title":"Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation","authors":"Jinbeom Kang, Joongmin Choi","doi":"10.1109/ISITC.2007.40","DOIUrl":null,"url":null,"abstract":"As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.","PeriodicalId":394071,"journal":{"name":"2007 International Symposium on Information Technology Convergence (ISITC 2007)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 International Symposium on Information Technology Convergence (ISITC 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISITC.2007.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 27

Abstract

As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.
利用视觉块分割检测信息网页块以实现有效的信息提取
随着Web页面的结构变得越来越复杂,包装器归纳规则的构造变得更加困难和耗时。大多数包装器归纳方法的主要问题是难以区分包含目标信息的有意义块和包含无关信息(如广告、菜单或版权声明)的噪声块。为了解决这一问题,本文提出了识别信息页面块(RIPB)算法,该算法利用视觉块分割方案来检测网页中的信息块。RIPB使用可视化页面分割算法将Web页面分析并划分为一组逻辑块,然后将具有相似结构的相关块分组为块聚类,并对聚类信息应用启发式规则来识别信息丰富的块聚类。一系列实验结果表明,RIPB允许封装器归纳模块在构建提取规则时只关注信息块信息而忽略其他噪声信息,有助于提高信息提取的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信