A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

2017 28th International Workshop on Database and Expert Systems Applications (DEXA) Pub Date : 2017-08-01 DOI:10.1109/DEXA.2017.21

Yiqun Zhang, C. Ordonez, S. Johnsson

{"title":"A Cloud System for Machine Learning Exploiting a Parallel Array DBMS","authors":"Yiqun Zhang, C. Ordonez, S. Johnsson","doi":"10.1109/DEXA.2017.21","DOIUrl":null,"url":null,"abstract":"Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.

查看原文本刊更多论文

利用并行阵列数据库管理系统的机器学习云系统

在云端计算机器学习模型仍然是大数据分析的核心问题。在这项工作中，我们介绍了一个利用基于经典无共享架构的并行阵列DBMS的云分析系统。我们的方法将数据库管理系统中的数据汇总与外部程序中的数学处理相结合。我们研究了如何在大量处理节点的情况下并行总结数据集，以及如何利用gpu进一步加速。与大多数大数据分析系统相比，我们不使用Java、HDFS、MapReduce或Spark:我们的系统是在传统的Unix系统上用c++和C编程的。ef地在我们的系统中,模型计算使用的一套创新的并行矩阵算子,计算综合统计总结大量输入数据集(矩阵)一遍,留下剩下的复杂的数学计算,与矩阵t在RAM中,r .为了竞争与Hadoop生态系统(即HDFS和火花抽样),我们还将介绍并行加载运营商对于大型矩阵和一个自动化,然而exible,集群con guration在云端。实验将我们的系统与Spark进行了比较，显示出数量级的时间改进。多核GPU进一步拉大了差距。总之，我们的系统是一个有竞争力的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 28th International Workshop on Database and Expert Systems Applications (DEXA)

自引率

0.00%

发文量