一种新的网页重复检测框架

2009 IEEE International Conference on Network Infrastructure and Digital Content Pub Date : 2009-12-31 DOI:10.1109/ICNIDC.2009.5360814

Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun

{"title":"一种新的网页重复检测框架","authors":"Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun","doi":"10.1109/ICNIDC.2009.5360814","DOIUrl":null,"url":null,"abstract":"There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.","PeriodicalId":127306,"journal":{"name":"2009 IEEE International Conference on Network Infrastructure and Digital Content","volume":"12 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel web page duplication detection framework\",\"authors\":\"Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun\",\"doi\":\"10.1109/ICNIDC.2009.5360814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.\",\"PeriodicalId\":127306,\"journal\":{\"name\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"volume\":\"12 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNIDC.2009.5360814\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Network Infrastructure and Digital Content","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNIDC.2009.5360814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

互联网上有很多冗余的网页。本文基于标签统计和文本相似度比较，提出了一种新的多层网页重复检测框架。我们提出了两种相似文本段落检测算法并实现了我们的框架。实验结果表明，该方法取得了较高的性能，仅通过标记统计和文本比较就能有效地检测出重复的网页。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel web page duplication detection framework

There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 IEEE International Conference on Network Infrastructure and Digital Content

自引率

0.00%

发文量