{"title":"GEM: An Efficient Entity Matching Framework for Geospatial Data","authors":"Setu Shah, Venkata Vamsikrishna Meduri, Mohamed Sarwat","doi":"10.1145/3474717.3483973","DOIUrl":null,"url":null,"abstract":"Identifying various mentions of the same real-world locations is known as spatial entity matching. GEM is an end-to-end Geospatial EM framework that matches polygon geometry entities in addition to point geometry type. Blocking, feature vector creation, and classification are the core steps of our system. GEM comprises of an efficient and lightweight blocking technique, GeoPrune, that uses the geohash encoding mechanism. We re-purpose the spatial proximality operators from Apache Sedona to create semantically rich spatial feature vectors. The classification step in GEM is a pluggable component, which consumes a unique feature vector and determines whether the geolocations match or not. We conduct experiments with three classifiers upon multiple large-scale geospatial datasets consisting of both spatial and relational attributes. GEM achieves an F-measure of 1.0 for a point x point dataset with 176k total pairs, which is 42% higher than a state-of-the-art spatial EM baseline. It achieves F-measures of 0.966 and 0.993 for the point x polygon dataset with 302M total pairs, and the polygon x polygon dataset with 16M total pairs respectively.","PeriodicalId":340759,"journal":{"name":"Proceedings of the 29th International Conference on Advances in Geographic Information Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Advances in Geographic Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474717.3483973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Identifying various mentions of the same real-world locations is known as spatial entity matching. GEM is an end-to-end Geospatial EM framework that matches polygon geometry entities in addition to point geometry type. Blocking, feature vector creation, and classification are the core steps of our system. GEM comprises of an efficient and lightweight blocking technique, GeoPrune, that uses the geohash encoding mechanism. We re-purpose the spatial proximality operators from Apache Sedona to create semantically rich spatial feature vectors. The classification step in GEM is a pluggable component, which consumes a unique feature vector and determines whether the geolocations match or not. We conduct experiments with three classifiers upon multiple large-scale geospatial datasets consisting of both spatial and relational attributes. GEM achieves an F-measure of 1.0 for a point x point dataset with 176k total pairs, which is 42% higher than a state-of-the-art spatial EM baseline. It achieves F-measures of 0.966 and 0.993 for the point x polygon dataset with 302M total pairs, and the polygon x polygon dataset with 16M total pairs respectively.