Incremental Web Page Template Detection by Text Segments

IEEE International Workshop on Semantic Computing and Systems Pub Date : 2008-07-14 DOI:10.1109/WSCS.2008.17

Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongbo Xu

引用次数: 4

Abstract

Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.

查看原文本刊更多论文

增量网页模板检测文本段

模板检测技术在许多应用中都很重要。大多数模板检测方法利用内容重复作为提示来检测需要大量Web页面作为输入的模板块。因此，它们通常以批处理的方式处理Web页面，直到收集到足够多的页面后才能处理新抓取的页面。这将消耗大量存储来缓存Web页面，并导致数据刷新的巨大延迟。在本文中，我们提出了一个增量框架来检测模板，其中页面在抓取后立即被处理。在这个框架下，我们不需要缓存任何Web页面。实验表明，该框架比传统方法占用的存储空间小于7%。并且完全消除了批处理过程引起的数据刷新延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Workshop on Semantic Computing and Systems

自引率

0.00%

发文量