{"title":"有效的网页变化监测与页面摘要","authors":"David J. Buttler, D. Rocco, Ling Liu","doi":"10.1145/1013367.1013533","DOIUrl":null,"url":null,"abstract":"The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.","PeriodicalId":409891,"journal":{"name":"WWW Alt. '04","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Efficient web change monitoring with page digest\",\"authors\":\"David J. Buttler, D. Rocco, Ling Liu\",\"doi\":\"10.1145/1013367.1013533\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.\",\"PeriodicalId\":409891,\"journal\":{\"name\":\"WWW Alt. '04\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"WWW Alt. '04\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1013367.1013533\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"WWW Alt. '04","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1013367.1013533","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.