Yanxu Zhu, Gang Yin, Huaimin Wang, Dian-xi Shi, Xiang Rao, Lin Yuan
{"title":"基于HTML文档缩进形状的重复模式挖掘方法","authors":"Yanxu Zhu, Gang Yin, Huaimin Wang, Dian-xi Shi, Xiang Rao, Lin Yuan","doi":"10.1109/CyberC.2011.15","DOIUrl":null,"url":null,"abstract":"Mining of repeated patterns from HTML documents is the key step towards Web-based data mining and knowledge extraction. Many web crawling applications need efficient repeated patterns mining techniques to generate their wrapper automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for practical web crawling applications. In this paper, we propose an efficient approach for mining repeated patterns based on indent shape of HTML document. Indent shape is a novel and simple model of HTML document, in which tandem repeated waves have strong association with the repeated patterns to be detected. By scanning an indent shape with a horizontal indent-line from bottom to top, the tandem repeated waves are identified by filtering the wave segments with low self-similarities. After that the boundary of HTML code corresponding to repeated patterns can be identified, which could be transformed to regular expressions formal-defined easily. Extensive experiments on two practical data sets retrieved from Internet show that our approach achieves high efficiency significantly, and its precision performance is also generally better than the existing approaches.","PeriodicalId":227472,"journal":{"name":"2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","volume":"321 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Approach for Repeated Patterns Mining Based on Indent Shape of HTML Documents\",\"authors\":\"Yanxu Zhu, Gang Yin, Huaimin Wang, Dian-xi Shi, Xiang Rao, Lin Yuan\",\"doi\":\"10.1109/CyberC.2011.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Mining of repeated patterns from HTML documents is the key step towards Web-based data mining and knowledge extraction. Many web crawling applications need efficient repeated patterns mining techniques to generate their wrapper automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for practical web crawling applications. In this paper, we propose an efficient approach for mining repeated patterns based on indent shape of HTML document. Indent shape is a novel and simple model of HTML document, in which tandem repeated waves have strong association with the repeated patterns to be detected. By scanning an indent shape with a horizontal indent-line from bottom to top, the tandem repeated waves are identified by filtering the wave segments with low self-similarities. After that the boundary of HTML code corresponding to repeated patterns can be identified, which could be transformed to regular expressions formal-defined easily. Extensive experiments on two practical data sets retrieved from Internet show that our approach achieves high efficiency significantly, and its precision performance is also generally better than the existing approaches.\",\"PeriodicalId\":227472,\"journal\":{\"name\":\"2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"volume\":\"321 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CyberC.2011.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CyberC.2011.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Efficient Approach for Repeated Patterns Mining Based on Indent Shape of HTML Documents
Mining of repeated patterns from HTML documents is the key step towards Web-based data mining and knowledge extraction. Many web crawling applications need efficient repeated patterns mining techniques to generate their wrapper automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for practical web crawling applications. In this paper, we propose an efficient approach for mining repeated patterns based on indent shape of HTML document. Indent shape is a novel and simple model of HTML document, in which tandem repeated waves have strong association with the repeated patterns to be detected. By scanning an indent shape with a horizontal indent-line from bottom to top, the tandem repeated waves are identified by filtering the wave segments with low self-similarities. After that the boundary of HTML code corresponding to repeated patterns can be identified, which could be transformed to regular expressions formal-defined easily. Extensive experiments on two practical data sets retrieved from Internet show that our approach achieves high efficiency significantly, and its precision performance is also generally better than the existing approaches.