万亿级分布式梯度下降优化的推测近似

Proceedings of the Fourth Workshop on Data analytics in the Cloud Pub Date : 2015-05-31 DOI:10.1145/2799562.2799563

Chengjie Qin, Florin Rusu

{"title":"万亿级分布式梯度下降优化的推测近似","authors":"Chengjie Qin, Florin Rusu","doi":"10.1145/2799562.2799563","DOIUrl":null,"url":null,"abstract":"Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.","PeriodicalId":106601,"journal":{"name":"Proceedings of the Fourth Workshop on Data analytics in the Cloud","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Speculative Approximations for Terascale Distributed Gradient Descent Optimization\",\"authors\":\"Chengjie Qin, Florin Rusu\",\"doi\":\"10.1145/2799562.2799563\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.\",\"PeriodicalId\":106601,\"journal\":{\"name\":\"Proceedings of the Fourth Workshop on Data analytics in the Cloud\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Fourth Workshop on Data analytics in the Cloud\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2799562.2799563\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth Workshop on Data analytics in the Cloud","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2799562.2799563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

模型校准是大数据应用中越来越多的统计分析软件包所面临的主要挑战。识别最佳模型参数是一个耗时的过程，即使是由经验丰富的数据科学家也必须从头开始执行每个数据集/模型组合。我们认为，无法同时评估多个参数配置以及缺乏快速识别次优配置的支持是主要原因。在本文中，我们开发了两种基于数据库的高效模型校准技术。推测参数测试应用先进的并行多查询处理方法来同时评估多个配置。在线聚合通过对训练数据集进行增量采样并估计每个配置对应的目标函数，在处理早期识别次优配置。我们设计并发在线聚合估计器，并定义停止条件，以准确及时地停止执行。我们将提出的技术应用于支持向量机和逻辑回归模型的分布式梯度下降优化(批处理和增量)。我们在GLADE PF-OLA(一个最先进的大数据分析系统)中实施了最终的解决方案，并在万亿尺度的合成和真实数据集上评估了它们的性能。结果证实，可以同时评估多达32种配置，速度几乎与1种配置一样快，而准确检测次优配置的时间仅为1/20。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speculative Approximations for Terascale Distributed Gradient Descent Optimization

Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascalesize synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th fraction of the time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Fourth Workshop on Data analytics in the Cloud

自引率

0.00%

发文量