{"title":"Catch Crawler: Automatic Web Information Extractor Using Style Sheet","authors":"Kwangcheol Shin, Geun-Sik Jo","doi":"10.1109/IWSCA.2008.23","DOIUrl":null,"url":null,"abstract":"Dataset should be free from noise for carrying out tasks of Web mining well. Generally commercial Web pages have a lot of noise which are not relevant to main contents such as navigation panel, advertisements, copyright notices or other service links. In this paper, we present a new automatic Web information extractor called dasiacatch crawlerpsila which uses style sheet to extract interesting data on a target site. Style sheets are generally used for uniform presentation of Web pages in a commercial Web site. To execute catch Crawler, a user lets catch Crawler know the interesting data area by clicking the data on a Web page. The catch Crawler automatically perceives the class of style sheet for the data and generates dataset from the whole Web site following the same style sheet class. Experimental results show that our approach for extracting noiseless Web data gives over 90% of accuracy on average.","PeriodicalId":425055,"journal":{"name":"2008 IEEE International Workshop on Semantic Computing and Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Workshop on Semantic Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWSCA.2008.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Dataset should be free from noise for carrying out tasks of Web mining well. Generally commercial Web pages have a lot of noise which are not relevant to main contents such as navigation panel, advertisements, copyright notices or other service links. In this paper, we present a new automatic Web information extractor called dasiacatch crawlerpsila which uses style sheet to extract interesting data on a target site. Style sheets are generally used for uniform presentation of Web pages in a commercial Web site. To execute catch Crawler, a user lets catch Crawler know the interesting data area by clicking the data on a Web page. The catch Crawler automatically perceives the class of style sheet for the data and generates dataset from the whole Web site following the same style sheet class. Experimental results show that our approach for extracting noiseless Web data gives over 90% of accuracy on average.