SparkGIS：资源感知高效内存空间查询处理。

Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems Pub Date : 2017-11-01

Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, Fusheng Wang

{"title":"SparkGIS：资源感知高效内存空间查询处理。","authors":"Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, Fusheng Wang","doi":"","DOIUrl":null,"url":null,"abstract":"Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.","PeriodicalId":90295,"journal":{"name":"Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6054321/pdf/nihms980878.pdf","citationCount":"0","resultStr":"{\"title\":\"SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing.\",\"authors\":\"Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, Fusheng Wang\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.\",\"PeriodicalId\":90295,\"journal\":{\"name\":\"Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6054321/pdf/nihms980878.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

为支持分布式空间计算系统中大量空间数据的高性能空间查询，特别是 MapReduce 范式，人们付出了很多努力。最近的工作重点是扩展空间 MapReduce 框架，以利用 Spark 等系统的高性能内存分布式处理能力。然而，在获得性能优势的同时，还需要有足够的内存和全面的配置。如果不能满足这一要求，就会退回到磁盘 IO，从而违背了此类系统的初衷，最糟糕的情况是内存不足，导致工作失败。由于底层内存系统无视空间数据的特征和特性，因此空间处理问题更加严重。在本文中，我们介绍了 SparkGIS--一个面向内存的空间数据查询系统，它通过调整 Apache Spark 的分布式处理能力，实现了高吞吐量和低延迟的空间查询处理。它支持基本的空间查询，包括包含、空间连接和 K 近邻，并允许将其扩展到复杂的查询管道。SparkGIS 支持多种适合当代丰富应用场景的动态分区算法，从而减轻了分布式处理中的偏差。多层次的全局和局部、预生成和按需内存索引允许 SparkGIS 对输入数据进行剪裁，并仅在相关空间对象的子集上应用计算密集型操作。最后，SparkGIS 采用了动态查询重写技术，可以从容管理超出可用分布式资源的大型空间查询工作流。我们的比较评估表明，对于相对较小的查询，SparkGIS 的性能与基于 Spark 的当代平台相当，而通过动态查询重写和高效的空间数据管理，SparkGIS 的性能在较大的数据和内存密集型工作流中优于它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing.

本刊更多论文

SparkGIS: Resource Aware Efficient In-Memory Spatial Query Processing.

Much effort has been devoted to support high performance spatial queries on large volumes of spatial data in distributed spatial computing systems, especially in the MapReduce paradigm. Recent works have focused on extending spatial MapReduce frameworks to leverage high performance in-memory distributed processing capabilities of systems such as Spark. However, the performance advantage comes with the requirement of having enough memory and comprehensive configuration. Failing to fulfill this falls back to disk IO, defeating the purpose of such systems or in worst case gets out of memory and fails the job. The problem is aggravated further for spatial processing since the underlying in-memory systems are oblivious of spatial data features and characteristics. In this paper we present SparkGIS - an in-memory oriented spatial data querying system for high throughput and low latency spatial query handling by adapting Apache Spark's distributed processing capabilities. It supports basic spatial queries including containment, spatial join and k-nearest neighbor and allows extending these to complex query pipelines. SparkGIS mitigates skew in distributed processing by supporting several dynamic partitioning algorithms suitable for a rich set of contemporary application scenarios. Multilevel global and local, pre-generated and on-demand in-memory indexes, allow SparkGIS to prune input data and apply compute intensive operations on a subset of relevant spatial objects only. Finally, SparkGIS employs dynamic query rewriting to gracefully manage large spatial query workflows that exceed available distributed resources. Our comparative evaluation has shown that the performance of SparkGIS is on par with contemporary Spark based platforms for relatively smaller queries and outperforms them for larger data and memory intensive workflows by dynamic query rewriting and efficient spatial data management.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems : ACM GIS. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

自引率

0.00%

发文量