Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun
{"title":"一种新的网页重复检测框架","authors":"Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun","doi":"10.1109/ICNIDC.2009.5360814","DOIUrl":null,"url":null,"abstract":"There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.","PeriodicalId":127306,"journal":{"name":"2009 IEEE International Conference on Network Infrastructure and Digital Content","volume":"12 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel web page duplication detection framework\",\"authors\":\"Zhongming Han, Dagao Duan, Hongzhi Liu, Jianzhi Sun\",\"doi\":\"10.1109/ICNIDC.2009.5360814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.\",\"PeriodicalId\":127306,\"journal\":{\"name\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"volume\":\"12 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Conference on Network Infrastructure and Digital Content\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNIDC.2009.5360814\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Network Infrastructure and Digital Content","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNIDC.2009.5360814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.