利用并行阵列数据库管理系统的机器学习云系统

Yiqun Zhang, C. Ordonez, S. Johnsson
{"title":"利用并行阵列数据库管理系统的机器学习云系统","authors":"Yiqun Zhang, C. Ordonez, S. Johnsson","doi":"10.1109/DEXA.2017.21","DOIUrl":null,"url":null,"abstract":"Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Cloud System for Machine Learning Exploiting a Parallel Array DBMS\",\"authors\":\"Yiqun Zhang, C. Ordonez, S. Johnsson\",\"doi\":\"10.1109/DEXA.2017.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.\",\"PeriodicalId\":127009,\"journal\":{\"name\":\"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DEXA.2017.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

在云端计算机器学习模型仍然是大数据分析的核心问题。在这项工作中,我们介绍了一个利用基于经典无共享架构的并行阵列DBMS的云分析系统。我们的方法将数据库管理系统中的数据汇总与外部程序中的数学处理相结合。我们研究了如何在大量处理节点的情况下并行总结数据集,以及如何利用gpu进一步加速。与大多数大数据分析系统相比,我们不使用Java、HDFS、MapReduce或Spark:我们的系统是在传统的Unix系统上用c++和C编程的。ef地在我们的系统中,模型计算使用的一套创新的并行矩阵算子,计算综合统计总结大量输入数据集(矩阵)一遍,留下剩下的复杂的数学计算,与矩阵t在RAM中,r .为了竞争与Hadoop生态系统(即HDFS和火花抽样),我们还将介绍并行加载运营商对于大型矩阵和一个自动化,然而exible,集群con guration在云端。实验将我们的系统与Spark进行了比较,显示出数量级的时间改进。多核GPU进一步拉大了差距。总之,我们的系统是一个有竞争力的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Computing machine learning models in the cloud remains a central problem in big data analytics. In this work, we introduce a cloud analytic system exploiting a parallel array DBMS based on a classical shared-nothing architecture. Our approach combines in-DBMS data summarization with mathematical processing in an external program. We study how to summarize a data set in parallel assuming a large number of processing nodes and how to further accelerate it with GPUs. In contrast to most big data analytic systems, we do not use Java, HDFS, MapReduce or Spark: our system is programmed in C++ and C on top of a traditional Unix le system. In our system, models are ef ciently computed using a suite of innovative parallel matrix operators, which compute comprehensive statistical summaries of a large input data set (matrix) in one pass, leaving the remaining mathematically complex computations, with matrices that t in RAM, to R. In order to be competitive with the Hadoop ecosystem (i.e. HDFS and Spark RDDs) we also introduce a parallel load operator for large matrices and an automated, yet exible, cluster con guration in the cloud. Experiments compare our system with Spark, showing orders of magnitude time improvement. A GPU with many cores widens the gap further. In summary, our system is a competitive solution.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信