增量网页模板检测文本段

IEEE International Workshop on Semantic Computing and Systems Pub Date : 2008-07-14 DOI:10.1109/WSCS.2008.17

Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongbo Xu

{"title":"增量网页模板检测文本段","authors":"Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongbo Xu","doi":"10.1109/WSCS.2008.17","DOIUrl":null,"url":null,"abstract":"Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.","PeriodicalId":378383,"journal":{"name":"IEEE International Workshop on Semantic Computing and Systems","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Incremental Web Page Template Detection by Text Segments\",\"authors\":\"Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongbo Xu\",\"doi\":\"10.1109/WSCS.2008.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.\",\"PeriodicalId\":378383,\"journal\":{\"name\":\"IEEE International Workshop on Semantic Computing and Systems\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Workshop on Semantic Computing and Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WSCS.2008.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Workshop on Semantic Computing and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WSCS.2008.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

模板检测技术在许多应用中都很重要。大多数模板检测方法利用内容重复作为提示来检测需要大量Web页面作为输入的模板块。因此，它们通常以批处理的方式处理Web页面，直到收集到足够多的页面后才能处理新抓取的页面。这将消耗大量存储来缓存Web页面，并导致数据刷新的巨大延迟。在本文中，我们提出了一个增量框架来检测模板，其中页面在抓取后立即被处理。在这个框架下，我们不需要缓存任何Web页面。实验表明，该框架比传统方法占用的存储空间小于7%。并且完全消除了批处理过程引起的数据刷新延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Incremental Web Page Template Detection by Text Segments

Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE International Workshop on Semantic Computing and Systems

自引率

0.00%

发文量