{"title":"提出了一种基于文档频率的网络爬虫数据集最小化技术","authors":"A. Sarhan, Ghada M. Hamissa, Heba E. Elbehiry","doi":"10.1109/ICCES.2015.7393008","DOIUrl":null,"url":null,"abstract":"The explosive growth of webpage number on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. In this research, two steps for Web Crawling are used to decrease these difficulties. First step is the feature selection for the datasets used. A proposed algorithm of feature selection, which uses the Document Frequency technique for the term in the category, is presented. Second step is Web page classification. Two famous techniques of Web page classification are used: (i) Support Vector Machine and (ii) Naïve Bayes Classifier. It is concluded that the proposed algorithm, using Document Frequency technique, reduces the redundancy during feature selection and increases accuracy during Web page classification. Complete evaluation is performed, in JAVA, to indicate the effectiveness of our proposed algorithm.","PeriodicalId":227813,"journal":{"name":"2015 Tenth International Conference on Computer Engineering & Systems (ICCES)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Proposed Document Frequency technique for minimizing dataset in Web crawler\",\"authors\":\"A. Sarhan, Ghada M. Hamissa, Heba E. Elbehiry\",\"doi\":\"10.1109/ICCES.2015.7393008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The explosive growth of webpage number on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. In this research, two steps for Web Crawling are used to decrease these difficulties. First step is the feature selection for the datasets used. A proposed algorithm of feature selection, which uses the Document Frequency technique for the term in the category, is presented. Second step is Web page classification. Two famous techniques of Web page classification are used: (i) Support Vector Machine and (ii) Naïve Bayes Classifier. It is concluded that the proposed algorithm, using Document Frequency technique, reduces the redundancy during feature selection and increases accuracy during Web page classification. Complete evaluation is performed, in JAVA, to indicate the effectiveness of our proposed algorithm.\",\"PeriodicalId\":227813,\"journal\":{\"name\":\"2015 Tenth International Conference on Computer Engineering & Systems (ICCES)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 Tenth International Conference on Computer Engineering & Systems (ICCES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES.2015.7393008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Tenth International Conference on Computer Engineering & Systems (ICCES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES.2015.7393008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Proposed Document Frequency technique for minimizing dataset in Web crawler
The explosive growth of webpage number on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. In this research, two steps for Web Crawling are used to decrease these difficulties. First step is the feature selection for the datasets used. A proposed algorithm of feature selection, which uses the Document Frequency technique for the term in the category, is presented. Second step is Web page classification. Two famous techniques of Web page classification are used: (i) Support Vector Machine and (ii) Naïve Bayes Classifier. It is concluded that the proposed algorithm, using Document Frequency technique, reduces the redundancy during feature selection and increases accuracy during Web page classification. Complete evaluation is performed, in JAVA, to indicate the effectiveness of our proposed algorithm.