SOPJ: A Scalable Online Provenance Join for Data Integration

Song Zhu, G. Fiameni, Giovanni Simonini, S. Bergamaschi
{"title":"SOPJ: A Scalable Online Provenance Join for Data Integration","authors":"Song Zhu, G. Fiameni, Giovanni Simonini, S. Bergamaschi","doi":"10.1109/HPCS.2017.23","DOIUrl":null,"url":null,"abstract":"Data integration is a technique used to combine different sources of data together to provide an unified view among them. MOMIS[1] is an open-source data integration framework developed by the DBGroup1. The goal of our work is to make MOMIS be able to scale-out as the input data sources increase without introducing noticeable performance penalty. In particular, we present a full outer join method capable to efficiently integrate multiple sources at the same time by using data streams and provenance information. To evaluate the scalability of this innovative approach, we developed a join engine employing a distributed data processing framework. Our solution is able to process input data sources in the form of continuous stream, execute the join operation on-the-fly and produce outputs as soon as they are generated. In this way, the join can return partial results before the input streams have been completely received or processed optimizing the entire execution. Encouraging results of adopting the proposed approach on real datasets closes the paper.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"18 16","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Data integration is a technique used to combine different sources of data together to provide an unified view among them. MOMIS[1] is an open-source data integration framework developed by the DBGroup1. The goal of our work is to make MOMIS be able to scale-out as the input data sources increase without introducing noticeable performance penalty. In particular, we present a full outer join method capable to efficiently integrate multiple sources at the same time by using data streams and provenance information. To evaluate the scalability of this innovative approach, we developed a join engine employing a distributed data processing framework. Our solution is able to process input data sources in the form of continuous stream, execute the join operation on-the-fly and produce outputs as soon as they are generated. In this way, the join can return partial results before the input streams have been completely received or processed optimizing the entire execution. Encouraging results of adopting the proposed approach on real datasets closes the paper.
SOPJ:用于数据集成的可伸缩在线来源连接
数据集成是一种用于将不同数据源组合在一起以在它们之间提供统一视图的技术。MOMIS[1]是由DBGroup1开发的开源数据集成框架。我们工作的目标是使MOMIS能够随着输入数据源的增加而向外扩展,而不会带来明显的性能损失。特别地,我们提出了一种完整的外部连接方法,该方法能够利用数据流和来源信息有效地同时集成多个数据源。为了评估这种创新方法的可伸缩性,我们开发了一个使用分布式数据处理框架的连接引擎。我们的解决方案能够以连续流的形式处理输入数据源,实时执行连接操作,并在生成输出时立即生成输出。通过这种方式,连接可以在完全接收或处理输入流之前返回部分结果,从而优化整个执行。在实际数据集上采用所提出的方法取得了令人鼓舞的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信