{"title":"A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings","authors":"A. Belussi, S. Migliorini, Ahmed Eldawy","doi":"10.1145/3657633","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model.\n \n This paper proposes a different approach which exploits the use of the novel notion of\n spatial embedding\n to overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.\n","PeriodicalId":43641,"journal":{"name":"ACM Transactions on Spatial Algorithms and Systems","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Spatial Algorithms and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3657633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"REMOTE SENSING","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model.
This paper proposes a different approach which exploits the use of the novel notion of
spatial embedding
to overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.
机器学习(ML)和深度学习(DL)技术越来越多地被应用于生成高效的查询优化器,尤其是在大数据系统方面。由于空间连接或范围查询等操作本身的复杂性以及空间数据的特殊性,空间操作的优化更具挑战性。虽然文献中已经提出了一些基于 ML 的空间查询优化器,但它们的设计限制了其使用,因为每个优化器都是为特定的数据集集合、特定的操作或特定的硬件设置量身定制的。要对其中任何一项进行更改,都需要建立和训练一个全新的模型,这就需要收集一个新的超大训练数据集,以获得一个良好的模型。 本文提出了一种不同的方法,利用新颖的空间嵌入概念来克服这些限制。特别是,本文定义了一个初步模型,该模型以无监督的方式捕捉空间数据集的相关特征,与需要优化的操作无关。该模型使用大量合成数据和真实世界数据进行训练,目的是生成有意义的空间嵌入。嵌入模型的构建可以作为许多不同空间操作优化的第一步,因此在随后构建特定模型时,可以补偿构建模型的成本。事实上,对于每一种考虑到的空间操作,都将通过使用空间嵌入作为输入来训练特定的定制模型,因此只需要很少的训练数据点。作为概念验证,本文考虑了三种特殊操作:范围查询、自连接和二进制空间连接。最后,本文与另一种称为迁移学习的技术进行了比较,并强调了所提出的技术与之相比的优势。
期刊介绍:
ACM Transactions on Spatial Algorithms and Systems (TSAS) is a scholarly journal that publishes the highest quality papers on all aspects of spatial algorithms and systems and closely related disciplines. It has a multi-disciplinary perspective in that it spans a large number of areas where spatial data is manipulated or visualized (regardless of how it is specified - i.e., geometrically or textually) such as geography, geographic information systems (GIS), geospatial and spatiotemporal databases, spatial and metric indexing, location-based services, web-based spatial applications, geographic information retrieval (GIR), spatial reasoning and mining, security and privacy, as well as the related visual computing areas of computer graphics, computer vision, geometric modeling, and visualization where the spatial, geospatial, and spatiotemporal data is central.