在特征库中优化机器学习的数据管道

Proc. VLDB Endow. Pub Date : 2023-09-01 DOI:10.14778/3625054.3625060

Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez

{"title":"在特征库中优化机器学习的数据管道","authors":"Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez","doi":"10.14778/3625054.3625060","DOIUrl":null,"url":null,"abstract":"Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new \"DBMS-for-ML\" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"24 1","pages":"4230-4239"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Data Pipelines for Machine Learning in Feature Stores\",\"authors\":\"Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, Jesús Camacho-Rodríguez\",\"doi\":\"10.14778/3625054.3625060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new \\\"DBMS-for-ML\\\" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":\"24 1\",\"pages\":\"4230-4239\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14778/3625054.3625060\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3625054.3625060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据管道（即将原始数据转换为特征）对于机器学习（ML）模型至关重要，但其开发和管理却非常耗时。最近，特征库作为一种新的 "DBMS-for-ML "出现了，其前提是让数据科学家和工程师能够定义和管理他们的数据管道。虽然从功能角度看，当前的特征库实现了它们的承诺，但它们却非常耗费资源--有大量机会实施数据库式的优化来提高它们的性能。在本文中，我们提出了一套新颖的优化方案，专门针对数据管道中的关键操作--时间点连接。我们在广泛使用的特征存储 Feathr 上实现了这些优化，并在 TPCx-AI 基准和真实世界在线零售场景的使用案例中对其进行了评估。全面的实验分析表明，与最先进的基线相比，我们的优化能将数据管道的速度提高 3 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Data Pipelines for Machine Learning in Feature Stores

Data pipelines (i.e., converting raw data to features) are critical for machine learning (ML) models, yet their development and management is time-consuming. Feature stores have recently emerged as a new "DBMS-for-ML" with the premise of enabling data scientists and engineers to define and manage their data pipelines. While current feature stores fulfill their promise from a functionality perspective, they are resource-hungry---with ample opportunities for implementing database-style optimizations to enhance their performance. In this paper, we propose a novel set of optimizations specifically targeted for point-in-time join, which is a critical operation in data pipelines. We implement these optimizations on top of Feathr: a widely-used feature store, and evaluate them on use cases from both the TPCx-AI benchmark and real-world online retail scenarios. Our thorough experimental analysis shows that our optimizations can accelerate data pipelines by up to 3× over state-of-the-art baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量