一种基于正则表达式的新闻网站抓取方法

Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim
{"title":"一种基于正则表达式的新闻网站抓取方法","authors":"Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim","doi":"10.1109/EIConCIT.2018.8878550","DOIUrl":null,"url":null,"abstract":"The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.","PeriodicalId":424909,"journal":{"name":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"An Approach of Web Scraping on News Website based on Regular Expression\",\"authors\":\"Achmad Maududie, Windi Eka Yulia Retnani, Muhamat Abdul Rohim\",\"doi\":\"10.1109/EIConCIT.2018.8878550\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.\",\"PeriodicalId\":424909,\"journal\":{\"name\":\"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EIConCIT.2018.8878550\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIConCIT.2018.8878550","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

新闻文档的高速增长在新闻网站不提供下载服务的情况下,出现了新的问题。本文介绍了一种从Detik、Tribunnews和liputan 3个新闻网站的HTML页面中提供新闻文章的标题、发布日期、作者、纯文本文章和URL地址的方法,无需手动复制粘贴过程。该方法包括三个步骤,即分析新闻网站结构,构建正则表达式模式,并将模式作为一套规则实现在网页抓取中。在实验的基础上,每个新闻网站对文章链接、文章标题、文章作者和文章发布日期都使用了自己的模式。特别是在提取新闻文章的纯文本阶段,有两种模式,即内容模式(用于提取新闻的原始文本文章)和过滤模式(用于去除非新闻元素)。在这三个新闻网站中,非新闻元素包括文字广告、视频广告、链接、图片和脚本,每个网站都有不同的模式。在生成所有需要的模式并将这些模式作为一组规则实现之后,web抓取模块在Detik和Tribunnews上产生了非常好的新闻文章提取结果,召回率=1,精度=1,F-Measure =100%,而Liputan6的结果稍低,召回率=0.95,精度=0.95,F-Measure =95%。结果表明,该方法是一种简单明了的提取新闻文章的方法,它由标题、发布日期、作者、新闻文章和新闻文章的URL地址组成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An Approach of Web Scraping on News Website based on Regular Expression
The high growth of news document emerging a new problem when the news website does not provide downloading service. This paper describes an approach of providing title, publication date, author, clean text article, and URL address of news article from HTML page of three news web-sites, i.e., Detik, Tribunnews, and Liputan6 without manually copy and paste process. This approach consists of three steps, i.e.: analyzing news website structure, constructing pattern of Regex and implementing the patterns as a set of rule in web scraping. Based on the experiment, each news web site used their own pattern for article link, article title, article author, and publication date of article. Special for extracting a clean text of news article phase, there were two kinds of pattern i.e.: content pattern (for extracting original text article of news) and filter pattern (for eliminating non-news elements). In these three-news website, the non-news elements consist of text advertisement, video advertisement, link, image, and script with different pattern for every website. After generated all necessary patterns and implemented these patterns as a set of rules, the web scraping module produced very good results of news article extraction on Detik and Tribunnews that was presented by recall = 1, precision = 1 and F-Measure =100% while Liputan6 had a little bit lower i.e., recall =0.95, precision =0.95, and F-Measure =95%. It is found that this approach is a simple and strait forward way to extract news article which consists of title, publication date, author, news article, and the URL address of news article.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信