基于空间嵌入的空间查询优化通用机器学习模型

IF 17.7 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Accounts of Chemical Research Pub Date : 2024-04-13 DOI:10.1145/3657633

A. Belussi, S. Migliorini, Ahmed Eldawy

{"title":"基于空间嵌入的空间查询优化通用机器学习模型","authors":"A. Belussi, S. Migliorini, Ahmed Eldawy","doi":"10.1145/3657633","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model.\n \n This paper proposes a different approach which exploits the use of the novel notion of\n spatial embedding\n to overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.\n","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":"91 2","pages":""},"PeriodicalIF":17.7000,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings\",\"authors\":\"A. Belussi, S. Migliorini, Ahmed Eldawy\",\"doi\":\"10.1145/3657633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model.\\n \\n This paper proposes a different approach which exploits the use of the novel notion of\\n spatial embedding\\n to overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.\\n\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":\"91 2\",\"pages\":\"\"},\"PeriodicalIF\":17.7000,\"publicationDate\":\"2024-04-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3657633\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3657633","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）和深度学习（DL）技术越来越多地被应用于生成高效的查询优化器，尤其是在大数据系统方面。由于空间连接或范围查询等操作本身的复杂性以及空间数据的特殊性，空间操作的优化更具挑战性。虽然文献中已经提出了一些基于 ML 的空间查询优化器，但它们的设计限制了其使用，因为每个优化器都是为特定的数据集集合、特定的操作或特定的硬件设置量身定制的。要对其中任何一项进行更改，都需要建立和训练一个全新的模型，这就需要收集一个新的超大训练数据集，以获得一个良好的模型。本文提出了一种不同的方法，利用新颖的空间嵌入概念来克服这些限制。特别是，本文定义了一个初步模型，该模型以无监督的方式捕捉空间数据集的相关特征，与需要优化的操作无关。该模型使用大量合成数据和真实世界数据进行训练，目的是生成有意义的空间嵌入。嵌入模型的构建可以作为许多不同空间操作优化的第一步，因此在随后构建特定模型时，可以补偿构建模型的成本。事实上，对于每一种考虑到的空间操作，都将通过使用空间嵌入作为输入来训练特定的定制模型，因此只需要很少的训练数据点。作为概念验证，本文考虑了三种特殊操作：范围查询、自连接和二进制空间连接。最后，本文与另一种称为迁移学习的技术进行了比较，并强调了所提出的技术与之相比的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings

Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model. This paper proposes a different approach which exploits the use of the novel notion of spatial embedding to overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Accounts of Chemical Research 化学-化学综合

CiteScore

31.40

自引率

1.10%

发文量

312

审稿时长

2 months

期刊介绍： Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.