A mediation system for continuous spatial queries on a unified schema using Apache Spark

IF 4.2 3区 地球科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang
{"title":"A mediation system for continuous spatial queries on a unified schema using Apache Spark","authors":"Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang","doi":"10.1080/20964471.2023.2275854","DOIUrl":null,"url":null,"abstract":"Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.","PeriodicalId":8765,"journal":{"name":"Big Earth Data","volume":" 22","pages":"0"},"PeriodicalIF":4.2000,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Earth Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/20964471.2023.2275854","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.
一个使用Apache Spark在统一模式上进行连续空间查询的中介系统
大数据和流数据系统的最新进展使物联网(IoT)系统和传感器在各个领域产生的数据能够实时分析。在这种情况下,许多应用程序需要集成来自多个异构源(流或静态源)的数据。像Apache Spark这样的框架能够集成和处理来自不同来源的大型数据集。然而,当数据源异构且数量众多时,这些框架很难使用。为了解决这个问题,我们提出了一个基于中介技术的系统,用于集成流和静态数据源。本系统的集成过程包括配置、查询表达和查询执行三个主要步骤。在配置步骤中,管理员设计一个中介模式,并定义中介模式与本地数据源之间的映射。在查询表达式步骤中,用户在中介模式上使用自定义SQL语法表示查询。最后,我们的系统将查询重写为优化后的Spark应用程序,并将该应用程序提交给Spark集群。结果不断返回给用户。我们的实验表明,我们的优化可以将查询执行时间提高一个数量级,使复杂的流和空间数据分析更易于访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Big Earth Data
Big Earth Data Earth and Planetary Sciences-Computers in Earth Sciences
CiteScore
7.40
自引率
10.00%
发文量
60
审稿时长
10 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信