A mediation system for continuous spatial queries on a unified schema using Apache Spark

IF 3.8 3区地球科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Big Earth Data Pub Date : 2023-11-09 DOI:10.1080/20964471.2023.2275854

Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang

{"title":"A mediation system for continuous spatial queries on a unified schema using Apache Spark","authors":"Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang","doi":"10.1080/20964471.2023.2275854","DOIUrl":null,"url":null,"abstract":"Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.","PeriodicalId":8765,"journal":{"name":"Big Earth Data","volume":" 22","pages":"0"},"PeriodicalIF":3.8000,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Earth Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/20964471.2023.2275854","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.

查看原文本刊更多论文

一个使用Apache Spark在统一模式上进行连续空间查询的中介系统

大数据和流数据系统的最新进展使物联网(IoT)系统和传感器在各个领域产生的数据能够实时分析。在这种情况下，许多应用程序需要集成来自多个异构源(流或静态源)的数据。像Apache Spark这样的框架能够集成和处理来自不同来源的大型数据集。然而，当数据源异构且数量众多时，这些框架很难使用。为了解决这个问题，我们提出了一个基于中介技术的系统，用于集成流和静态数据源。本系统的集成过程包括配置、查询表达和查询执行三个主要步骤。在配置步骤中，管理员设计一个中介模式，并定义中介模式与本地数据源之间的映射。在查询表达式步骤中，用户在中介模式上使用自定义SQL语法表示查询。最后，我们的系统将查询重写为优化后的Spark应用程序，并将该应用程序提交给Spark集群。结果不断返回给用户。我们的实验表明，我们的优化可以将查询执行时间提高一个数量级，使复杂的流和空间数据分析更易于访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊