基于gpu的SQL数据库内机器学习

33rd International Conference on Scientific and Statistical Database Management Pub Date : 2021-07-06 DOI:10.1145/3468791.3468840

Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann

{"title":"基于gpu的SQL数据库内机器学习","authors":"Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann","doi":"10.1145/3468791.3468840","DOIUrl":null,"url":null,"abstract":"In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.","PeriodicalId":312773,"journal":{"name":"33rd International Conference on Scientific and Statistical Database Management","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"In-Database Machine Learning with SQL on GPUs\",\"authors\":\"Maximilian E. Schüle, Harald Lang, M. Springer, A. Kemper, Thomas Neumann, Stephan Günnemann\",\"doi\":\"10.1145/3468791.3468840\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.\",\"PeriodicalId\":312773,\"journal\":{\"name\":\"33rd International Conference on Scientific and Statistical Database Management\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"33rd International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3468791.3468840\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"33rd International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3468791.3468840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

在机器学习中，不断重新训练模型可以保证基于最新数据作为训练输入的准确预测。但是，要从数据库中检索最新的数据，由于数据库系统很少用于矩阵代数和梯度下降等操作，因此需要进行耗时的提取。在这项工作中，我们证明了使用递归表的SQL可以从数据预处理、模型训练和验证中表达完整的机器学习管道。为了便于对损失函数进行说明，我们通过一个自动微分算子扩展了代码生成数据库系统Umbra，以便在递归表中使用:将损失函数用SQL表示为lambda函数，Umbra为每个偏导数生成机器码。我们进一步对专用梯度下降算子使用自动微分，该算子生成LLVM代码以在gpu上训练用户指定的模型。我们在硬件级别微调GPU内核，以允许更高的吞吐量，并提出多个单元的非阻塞同步。在我们的评估中，与单独编译每个导数相比，自动微分通过缓存子表达式的数量加速了运行时。我们的GPU内核具有独立的模型，即使在小批处理规模下也能实现最大的吞吐量，这使得SQL中的机器学习管道更具竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

In-Database Machine Learning with SQL on GPUs

In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

33rd International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量