An Approach of Web Scraping on News Website based on Regular Expression

2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT) Pub Date : 2018-11-01 DOI:10.1109/EIConCIT.2018.8878550

Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim

{"title":"An Approach of Web Scraping on News Website based on Regular Expression","authors":"Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim","doi":"10.1109/EIConCIT.2018.8878550","DOIUrl":null,"url":null,"abstract":"The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.","PeriodicalId":424909,"journal":{"name":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIConCIT.2018.8878550","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.

查看原文本刊更多论文

一种基于正则表达式的新闻网站抓取方法

新闻文档的高速增长在新闻网站不提供下载服务的情况下，出现了新的问题。本文介绍了一种从Detik、Tribunnews和liputan 3个新闻网站的HTML页面中提供新闻文章的标题、发布日期、作者、纯文本文章和URL地址的方法，无需手动复制粘贴过程。该方法包括三个步骤，即分析新闻网站结构，构建正则表达式模式，并将模式作为一套规则实现在网页抓取中。在实验的基础上，每个新闻网站对文章链接、文章标题、文章作者和文章发布日期都使用了自己的模式。特别是在提取新闻文章的纯文本阶段，有两种模式，即内容模式(用于提取新闻的原始文本文章)和过滤模式(用于去除非新闻元素)。在这三个新闻网站中，非新闻元素包括文字广告、视频广告、链接、图片和脚本，每个网站都有不同的模式。在生成所有需要的模式并将这些模式作为一组规则实现之后，web抓取模块在Detik和Tribunnews上产生了非常好的新闻文章提取结果，召回率=1，精度=1,F-Measure =100%，而Liputan6的结果稍低，召回率=0.95，精度=0.95,F-Measure =95%。结果表明，该方法是一种简单明了的提取新闻文章的方法，它由标题、发布日期、作者、新闻文章和新闻文章的URL地址组成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)

自引率

0.00%

发文量