Efficient web change monitoring with page digest

David J. Buttler, D. Rocco, Ling Liu
{"title":"Efficient web change monitoring with page digest","authors":"David J. Buttler, D. Rocco, Ling Liu","doi":"10.1145/1013367.1013533","DOIUrl":null,"url":null,"abstract":"The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.","PeriodicalId":409891,"journal":{"name":"WWW Alt. '04","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WWW Alt. '04","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1013367.1013533","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.
有效的网页变化监测与页面摘要
因特网和万维网使大量有用的在线信息的出版成为可能,这也产生了信息过载的不幸的副作用:个人越来越难以跟上最新信息的步伐。在本文中,我们描述了一种构建系统的方法,用于有效地监视Web文档的更改。本文有三个主要贡献。首先,我们提出了一个能够捕捉Web文档不同特征的连贯框架。该系统使用Page Digest编码为Web文档的内容、结构和其他有趣的属性提供全面的监控系统。其次,Page Digest编码通过短路评估、用于文档和结构相似性的线性时间算法以及减少数据大小等机制,提高了单个页面监视器的性能。最后,我们开发了一系列基于Page Digest编码的哨兵分组技术,通过将类似的监控请求分组在一起来减少大规模监控系统中的冗余处理。我们研究了这些技术在广泛的参数范围内的有效性,并发现与现有的基于web的信息监控系统相比,效率提高了一个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信