{"title":"Design And Implementation of Web Crawler Based on 'Internet +'Data Automatic Extraction","authors":"Lulu Zhang, Junru Li, Dacheng Feng, Junjie Sun","doi":"10.1109/ICCECE58074.2023.10135210","DOIUrl":null,"url":null,"abstract":"The rise of the strategy of “Internet +” breaks the barriers of data and information. Web crawler is widely used in data acquisition and data analysis in the massive Internet plus information. Taking “IMDB top250 movies” as the goal, using the crawler technology based on Python language, this paper explains the four steps of web crawler in detail, compares the differences of three web page parsing methods: BeautifulSoup, Regular Expression(Re) and XPath, and completes the crawling of target data. The experimental results show that Re is the best in data analysis speed; In terms of web page parsing logic, beautiful soup is the best; From the perspective of comprehensive use, XPath is more suitable.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135210","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rise of the strategy of “Internet +” breaks the barriers of data and information. Web crawler is widely used in data acquisition and data analysis in the massive Internet plus information. Taking “IMDB top250 movies” as the goal, using the crawler technology based on Python language, this paper explains the four steps of web crawler in detail, compares the differences of three web page parsing methods: BeautifulSoup, Regular Expression(Re) and XPath, and completes the crawling of target data. The experimental results show that Re is the best in data analysis speed; In terms of web page parsing logic, beautiful soup is the best; From the perspective of comprehensive use, XPath is more suitable.