JoinBoost: Grow Trees Over Normalized Data Using Only SQL

Proc. VLDB Endow. Pub Date : 2023-07-01 DOI:10.48550/arXiv.2307.00422

Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu

{"title":"JoinBoost: Grow Trees Over Normalized Data Using Only SQL","authors":"Zezhou Huang, Rathijit Sen, Jiaxiang Liu, Eugene Wu","doi":"10.48550/arXiv.2307.00422","DOIUrl":null,"url":null,"abstract":"Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?\n \n We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the\n Y\n variable to the residual in the\n non-materialized join result.\n Although this view update problem is generally ambiguous, we identify\n addition-to-multiplication preserving\n , the key property of variance semi-ring to support\n rmse\n the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"34 1","pages":"3071-3084"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.00422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the Y variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving , the key property of variance semi-ring to support rmse the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).

查看原文本刊更多论文

JoinBoost:仅使用SQL在规范化数据上生长树

虽然在表格数据中占主导地位，但在规范化数据库(例如LightGBM, XGBoost)上训练树模型的ML库要求将数据反规范化为单个表，具体化并导出。此过程不可扩展，速度慢，并且存在安全风险。In-DB ML旨在训练dbms中的模型，以避免数据移动并提供数据治理。与其修改DBMS来支持In-DB ML，还不如为专门的ML库提供有竞争力的树训练性能……只使用SQL?我们介绍了JoinBoost，一个Python库，它将规范化数据库上的树训练算法重写为纯SQL。它可移植到任何DBMS，提供与专用ML库竞争的性能，并随DBMS的底层功能进行扩展。JoinBoost从算法和系统的角度扩展了先前的工作。在算法上，我们通过将Y变量更新为非物化连接结果中的残差来支持因式梯度提升。虽然这种视图更新问题通常是模棱两可的，但我们确定了方差半环支持rmse的关键属性——保持加法到乘法，这是最广泛使用的准则。在系统方面，我们认为残留更新是性能瓶颈。在列式dbms上，可以通过创建一个新的残值列并将其作为投影添加来最小化这种开销。我们用DuckDB上的两个实现验证了这一点，为了可移植性，对其内部没有或只有很少的修改。我们的实验表明，与LightGBM相比，JoinBoost在随机森林(梯度增强)方面的速度快了3倍(1.1倍)，比最先进的In-DB ML系统快了一个数量级。此外，JoinBoost在#特征、数据库大小(TPC-DS SF=1000)和连接图复杂性(星系模式)方面的扩展性远远超过LightGBM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量