Periodic update and automatic extraction of web data for creating a Google Earth based tool

T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana
{"title":"Periodic update and automatic extraction of web data for creating a Google Earth based tool","authors":"T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana","doi":"10.1109/ICACSIS.2015.7415157","DOIUrl":null,"url":null,"abstract":"A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.","PeriodicalId":325539,"journal":{"name":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2015.7415157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.
定期更新和自动提取web数据,用于创建基于谷歌地球的工具
印度尼西亚新闻门户网站在网上报道了许多发生在印度尼西亚的热带疾病病例。在线新闻门户网站现在正成为重要的信息来源,因为在线新闻文章经常更新。一种基于规则的,结合机器学习算法,来识别案例的位置已经开发出来。本文给出了常规搜索、抓取、清理、分类、提取并将提取的实体整合到Google Earth中的完整流程。该算法首先使用一组选定的查询和谷歌网站搜索API搜索印尼新闻文章,然后对它们进行爬行。在抓取文章后,对其进行清理和分类。对讨论热带病病例(分类为阳性)的文章进行进一步检查,以提取发病率的措辞,并确定包含发生日期和伤亡人数的句子。然后将提取的实体存储在关系数据库中,并用XML keyhole标记语言符号进行注释,以便在Google Earth中创建地理可视化。评估表明,从5篇Web文章中搜索、抓取、清理、分类、提取并将提取的实体标注为XML keyhole标记语言符号大约需要6分钟。换句话说,处理一个新页面大约需要72.40秒。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信