{"title":"基于聚焦爬虫的文本挖掘算法研究","authors":"Qiusheng Zhang, M. Lin, J. Jun, Xingyun Zhang","doi":"10.1109/ICCSE.2017.8085535","DOIUrl":null,"url":null,"abstract":"Internet has become the world's largest information repository, especially the explosive growth of the text data on the web, the disadvantages that it need much more time to acquire and update web pages, and is not high precision have become more obvious. The text mining algorithm based on focused crawler is proposed in this paper, it classifies and integrates the whole web pages by topic using topic crawler algorithm as much as possible, which greatly improves the retrieval ability of the web pages, naive bayes algorithm is adopted on this basis, which realizes the text mining processing of the web data. The experimental results show that the algorithm has good feasibility and higher recall ratio and precision ratio of the web pages.","PeriodicalId":256055,"journal":{"name":"2017 12th International Conference on Computer Science and Education (ICCSE)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Research on text mining algorithm based on focused crawler\",\"authors\":\"Qiusheng Zhang, M. Lin, J. Jun, Xingyun Zhang\",\"doi\":\"10.1109/ICCSE.2017.8085535\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Internet has become the world's largest information repository, especially the explosive growth of the text data on the web, the disadvantages that it need much more time to acquire and update web pages, and is not high precision have become more obvious. The text mining algorithm based on focused crawler is proposed in this paper, it classifies and integrates the whole web pages by topic using topic crawler algorithm as much as possible, which greatly improves the retrieval ability of the web pages, naive bayes algorithm is adopted on this basis, which realizes the text mining processing of the web data. The experimental results show that the algorithm has good feasibility and higher recall ratio and precision ratio of the web pages.\",\"PeriodicalId\":256055,\"journal\":{\"name\":\"2017 12th International Conference on Computer Science and Education (ICCSE)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 12th International Conference on Computer Science and Education (ICCSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCSE.2017.8085535\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Conference on Computer Science and Education (ICCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSE.2017.8085535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research on text mining algorithm based on focused crawler
Internet has become the world's largest information repository, especially the explosive growth of the text data on the web, the disadvantages that it need much more time to acquire and update web pages, and is not high precision have become more obvious. The text mining algorithm based on focused crawler is proposed in this paper, it classifies and integrates the whole web pages by topic using topic crawler algorithm as much as possible, which greatly improves the retrieval ability of the web pages, naive bayes algorithm is adopted on this basis, which realizes the text mining processing of the web data. The experimental results show that the algorithm has good feasibility and higher recall ratio and precision ratio of the web pages.