Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim
{"title":"An Approach of Web Scraping on News Website based on Regular Expression","authors":"Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim","doi":"10.1109/EIConCIT.2018.8878550","DOIUrl":null,"url":null,"abstract":"The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.","PeriodicalId":424909,"journal":{"name":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIConCIT.2018.8878550","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.