KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2017-08-13 DOI:10.1145/3097983.3098029

Jun Zhou, Xiaolong Li, P. Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, Yuan Qi

{"title":"KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial","authors":"Jun Zhou, Xiaolong Li, P. Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, Yuan Qi","doi":"10.1145/3097983.3098029","DOIUrl":null,"url":null,"abstract":"In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3097983.3098029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

Abstract

In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.

查看原文本刊更多论文

鲲鹏:基于参数服务器的分布式学习系统及其在阿里巴巴和蚂蚁金服中的应用

近年来，由于大数据(tb或pb)和大模型(数百亿参数)的出现，学术界和工业界对并行机器学习(ML)算法的需求不断增加。虽然有一些现有的分布式计算系统，如Hadoop和Spark，用于并行ML算法，但它们只提供同步和粗粒度的操作符(例如Map、Reduce和Join等)，这可能会阻碍开发人员实现更高效的算法。这促使我们设计了一个通用的分布式平台，称为鲲鹏，它结合了分布式系统和并行优化算法来处理大规模机器学习带来的复杂性。具体来说，鲲鹏不仅封装了数据/模型并行、负载平衡、模型同步、稀疏表示、工业容错等特性，而且还提供了易于使用的界面，使用户能够专注于核心机器学习逻辑。在具有数十亿个样本和特征的tb级真实数据集上的经验结果表明，这样的设计为ML程序带来了令人信服的性能改进，从遵循正则化领导者近端算法到稀疏逻辑回归和多加性回归树。此外，鲲鹏的令人鼓舞的表现也显示了几个现实世界的应用程序，包括阿里巴巴的双11在线购物节和蚂蚁金服的交易风险估计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量