{"title":"Semantic extraction of geographic data from web tables for big data integration","authors":"I. Cruz, Venkat R. Ganesh, Seyed Iman Mirrezaei","doi":"10.1145/2533888.2533939","DOIUrl":null,"url":null,"abstract":"There are millions of web tables with geographic data that are pertinent for big data integration in a variety of domain applications, such as urban sustainability, transportation networks, policy studies, and public health. These tables, however, are heterogeneous in structure, concepts, and metadata. One of the challenges in semantically extracting geographic data is the need to resolve these heterogeneities so as to uncover a conceptual hierarchy, metadata associated with instances, and geographic information---corresponding respectively to ontologies, elements that we call features, and cell values that can be used to identify geographic coordinates. In this paper, we present an architecture with methods to: (1) extract feature-rich web tables; (2) identify features; (3) construct a schema and instances using RDF; (4) perform geocoding. Preliminary experiments led to high accuracy in table identification and feature naming even when compared to manual evaluation.","PeriodicalId":167948,"journal":{"name":"Workshop on Geographic Information Retrieval","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Geographic Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2533888.2533939","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
There are millions of web tables with geographic data that are pertinent for big data integration in a variety of domain applications, such as urban sustainability, transportation networks, policy studies, and public health. These tables, however, are heterogeneous in structure, concepts, and metadata. One of the challenges in semantically extracting geographic data is the need to resolve these heterogeneities so as to uncover a conceptual hierarchy, metadata associated with instances, and geographic information---corresponding respectively to ontologies, elements that we call features, and cell values that can be used to identify geographic coordinates. In this paper, we present an architecture with methods to: (1) extract feature-rich web tables; (2) identify features; (3) construct a schema and instances using RDF; (4) perform geocoding. Preliminary experiments led to high accuracy in table identification and feature naming even when compared to manual evaluation.