GeoFlux: Hands-Off Data Integration Leveraging Join Key Knowledge

Jie Song, Danai Koutra, Murali Mani, H. Jagadish
{"title":"GeoFlux: Hands-Off Data Integration Leveraging Join Key Knowledge","authors":"Jie Song, Danai Koutra, Murali Mani, H. Jagadish","doi":"10.1145/3183713.3193546","DOIUrl":null,"url":null,"abstract":"Data integration is frequently required to obtain the full value of data from multiple sources. In spite of extensive research on tools to assist users, data integration remains hard, particularly for users with limited technical proficiency. To address this barrier, we study how much we can do with no user guidance. Our vision is that the user should merely specify two input datasets to be joined and get a meaningful integrated result. It turns out that our vision can be realized if the system can correctly determine the join key, for example based on domain knowledge. We demonstrate this notion by considering a broad domain: socioeconomic data aggregated by geography, a widespread category that accounts for 80% of the data published by government agencies. Intuitively two such datasets can be integrated by joining on the geographic unit column. Although it sounds easy, this task has many challenges: How can we automatically identify columns corresponding to geographic units, other dimension variables and measure variables, respectively? If multiple geographic types exist, which one should be chosen for the join? How to join tables with idiosyncratic schema, different geographic units of aggregation or no aggregation at all? We have developed GeoFlux, a data integration system that handles all these challenges and joins tabular data by automatically aggregating geographic information with a new, advanced crosswalk algorithm. In this demo paper, we overview the architecture of the system and its user-friendly interfaces, and then demonstrate via a real-world example that it is general, fully automatic and easy-to-use. In the demonstration, we invite users to interact with GeoFlux to integrate more sample socioeconomic data from data.ny.gov.","PeriodicalId":20430,"journal":{"name":"Proceedings of the 2018 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3183713.3193546","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Data integration is frequently required to obtain the full value of data from multiple sources. In spite of extensive research on tools to assist users, data integration remains hard, particularly for users with limited technical proficiency. To address this barrier, we study how much we can do with no user guidance. Our vision is that the user should merely specify two input datasets to be joined and get a meaningful integrated result. It turns out that our vision can be realized if the system can correctly determine the join key, for example based on domain knowledge. We demonstrate this notion by considering a broad domain: socioeconomic data aggregated by geography, a widespread category that accounts for 80% of the data published by government agencies. Intuitively two such datasets can be integrated by joining on the geographic unit column. Although it sounds easy, this task has many challenges: How can we automatically identify columns corresponding to geographic units, other dimension variables and measure variables, respectively? If multiple geographic types exist, which one should be chosen for the join? How to join tables with idiosyncratic schema, different geographic units of aggregation or no aggregation at all? We have developed GeoFlux, a data integration system that handles all these challenges and joins tabular data by automatically aggregating geographic information with a new, advanced crosswalk algorithm. In this demo paper, we overview the architecture of the system and its user-friendly interfaces, and then demonstrate via a real-world example that it is general, fully automatic and easy-to-use. In the demonstration, we invite users to interact with GeoFlux to integrate more sample socioeconomic data from data.ny.gov.
GeoFlux:利用连接关键知识的数据集成
为了从多个数据源获得数据的全部价值,经常需要进行数据集成。尽管对帮助用户的工具进行了广泛的研究,但数据集成仍然很困难,特别是对于技术熟练程度有限的用户。为了解决这个障碍,我们研究了在没有用户指导的情况下我们能做多少事情。我们的设想是,用户应该只指定两个输入数据集来连接,并得到一个有意义的集成结果。事实证明,如果系统能够正确地确定连接键,例如基于领域知识,则可以实现我们的愿景。我们通过考虑一个广泛的领域来证明这一概念:按地理位置汇总的社会经济数据,这是一个广泛的类别,占政府机构发布的数据的80%。直观地说,两个这样的数据集可以通过连接地理单位列来集成。虽然听起来很简单,但这项任务有许多挑战:我们如何自动识别分别对应于地理单位、其他维度变量和度量变量的列?如果存在多个地理类型,应该选择哪一个进行连接?如何连接具有特殊模式、不同地理聚合单元或根本没有聚合的表?我们已经开发了GeoFlux,这是一个数据集成系统,可以处理所有这些挑战,并通过使用新的先进的人行横道算法自动聚合地理信息来连接表格数据。在这篇演示论文中,我们概述了系统的架构及其用户友好的界面,然后通过一个现实世界的例子来演示它是通用的,全自动的和易于使用的。在演示中,我们邀请用户与GeoFlux进行交互,以整合来自data.ny.gov的更多样本社会经济数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信