{"title":"Computer Vision-based Web Scraping for Internet Forums","authors":"Eric C. Dallmeier","doi":"10.1109/ICOA51614.2021.9442634","DOIUrl":null,"url":null,"abstract":"With the amount of data available on websites the need to transform this data from a human-understandable format, the visual representation, to a computer-understandable format, e.g. as entries in a database, rises. The approaches to solving web scraping that were published in the last two decades have the drawback that they all to a certain degree rely on the structure and existence of the underlying Hypertext Markup Language (HTML) or Cascading Style Sheets (CSS). To reduce this dependency and move to understanding websites more human-like, this short paper presents a scientific project that proposes a web scraping approach based solely on the visual representation of a given website. For this purpose existing approaches from the domain of Document Layout Analysis and Optical Character Recognition (OCR) are taken into concern. This short paper provides relevant background knowledge to the involved fields of science and proposes a methodology along which the suggested approach can be implemented and tested in further work.","PeriodicalId":352572,"journal":{"name":"2021 7th International Conference on Optimization and Applications (ICOA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Optimization and Applications (ICOA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOA51614.2021.9442634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
With the amount of data available on websites the need to transform this data from a human-understandable format, the visual representation, to a computer-understandable format, e.g. as entries in a database, rises. The approaches to solving web scraping that were published in the last two decades have the drawback that they all to a certain degree rely on the structure and existence of the underlying Hypertext Markup Language (HTML) or Cascading Style Sheets (CSS). To reduce this dependency and move to understanding websites more human-like, this short paper presents a scientific project that proposes a web scraping approach based solely on the visual representation of a given website. For this purpose existing approaches from the domain of Document Layout Analysis and Optical Character Recognition (OCR) are taken into concern. This short paper provides relevant background knowledge to the involved fields of science and proposes a methodology along which the suggested approach can be implemented and tested in further work.