Speculative Approximations for Terascale Distributed Gradient Descent Optimization

Chengjie Qin, Florin Rusu
{"title":"Speculative Approximations for Terascale Distributed Gradient Descent Optimization","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/2799562.2799563","DOIUrl":null,"url":null,"abstract":"Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.","PeriodicalId":106601,"journal":{"name":"Proceedings of the Fourth Workshop on Data analytics in the Cloud","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth Workshop on Data analytics in the Cloud","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2799562.2799563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.
万亿级分布式梯度下降优化的推测近似
模型校准是大数据应用中越来越多的统计分析软件包所面临的主要挑战。识别最佳模型参数是一个耗时的过程,即使是由经验丰富的数据科学家也必须从头开始执行每个数据集/模型组合。我们认为,无法同时评估多个参数配置以及缺乏快速识别次优配置的支持是主要原因。在本文中,我们开发了两种基于数据库的高效模型校准技术。推测参数测试应用先进的并行多查询处理方法来同时评估多个配置。在线聚合通过对训练数据集进行增量采样并估计每个配置对应的目标函数,在处理早期识别次优配置。我们设计并发在线聚合估计器,并定义停止条件,以准确及时地停止执行。我们将提出的技术应用于支持向量机和逻辑回归模型的分布式梯度下降优化(批处理和增量)。我们在GLADE PF-OLA(一个最先进的大数据分析系统)中实施了最终的解决方案,并在万亿尺度的合成和真实数据集上评估了它们的性能。结果证实,可以同时评估多达32种配置,速度几乎与1种配置一样快,而准确检测次优配置的时间仅为1/20。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信