Distributed MCMC Inference in Dirichlet Process Mixture Models Using Julia

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00066

Or Dinari, A. Yu, O. Freifeld, John W. Fisher III

{"title":"Distributed MCMC Inference in Dirichlet Process Mixture Models Using Julia","authors":"Or Dinari, A. Yu, O. Freifeld, John W. Fisher III","doi":"10.1109/CCGRID.2019.00066","DOIUrl":null,"url":null,"abstract":"Due to the increasing availability of large data sets, the need for general-purpose massively-parallel analysis tools become ever greater. In unsupervised learning, Bayesian nonparametric mixture models, exemplified by the Dirichlet-Process Mixture Model (DPMM), provide a principled Bayesian approach to adapt model complexity to the data. Despite their potential, however, DPMMs have yet to become a popular tool. This is partly due to the lack of friendly software tools that can handle large datasets efficiently. Here we show how, using Julia, one can achieve efficient and easily-modifiable implementation of distributed inference in DPMMs. Particularly, we show how a recent parallel MCMC inference algorithm - originally implemented in C++ for a single multi-core machine - can be distributed efficiently across multiple multi-core machines using a distributed-memory model. This leads to speedups, alleviates memory and storage limitations, and lets us learn DPMMs from significantly larger datasets and of higher dimensionality. It also turned out that even on a single machine the proposed Julia implementation handles higher dimensions more gracefully (at least for Gaussians) than the original C++ implementation. Finally, we use the proposed implementation to learn a model of image patches and apply the learned model for image denoising. While we speculate that a highly-optimized distributed implementation in, say, C++ could have been faster than the proposed implementation in Julia, from our perspective as machine-learning researchers (as opposed to HPC researchers), the latter also offers a practical and monetary value due to the ease of development and abstraction level. Our code is publicly available at https://github.com/dinarior/dpmm subclusters.jl","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Due to the increasing availability of large data sets, the need for general-purpose massively-parallel analysis tools become ever greater. In unsupervised learning, Bayesian nonparametric mixture models, exemplified by the Dirichlet-Process Mixture Model (DPMM), provide a principled Bayesian approach to adapt model complexity to the data. Despite their potential, however, DPMMs have yet to become a popular tool. This is partly due to the lack of friendly software tools that can handle large datasets efficiently. Here we show how, using Julia, one can achieve efficient and easily-modifiable implementation of distributed inference in DPMMs. Particularly, we show how a recent parallel MCMC inference algorithm - originally implemented in C++ for a single multi-core machine - can be distributed efficiently across multiple multi-core machines using a distributed-memory model. This leads to speedups, alleviates memory and storage limitations, and lets us learn DPMMs from significantly larger datasets and of higher dimensionality. It also turned out that even on a single machine the proposed Julia implementation handles higher dimensions more gracefully (at least for Gaussians) than the original C++ implementation. Finally, we use the proposed implementation to learn a model of image patches and apply the learned model for image denoising. While we speculate that a highly-optimized distributed implementation in, say, C++ could have been faster than the proposed implementation in Julia, from our perspective as machine-learning researchers (as opposed to HPC researchers), the latter also offers a practical and monetary value due to the ease of development and abstraction level. Our code is publicly available at https://github.com/dinarior/dpmm subclusters.jl

查看原文本刊更多论文

基于Julia的Dirichlet过程混合模型的分布式MCMC推理

由于大型数据集的可用性越来越高，对通用大规模并行分析工具的需求变得越来越大。在无监督学习中，以Dirichlet-Process混合模型(DPMM)为例的贝叶斯非参数混合模型提供了一种原则性的贝叶斯方法来适应数据的模型复杂性。然而，尽管有潜力，DPMMs还没有成为一种流行的工具。这部分是由于缺乏友好的软件工具，可以有效地处理大型数据集。在这里，我们将展示如何使用Julia在dpmm中实现高效且易于修改的分布式推理。特别是，我们展示了最近的并行MCMC推理算法——最初是在c++中为单个多核机器实现的——如何使用分布式内存模型有效地分布在多个多核机器上。这导致了加速，减轻了内存和存储限制，并使我们能够从更大的数据集和更高的维度中学习DPMMs。结果还表明，即使在一台机器上，Julia实现也比原来的c++实现更优雅地处理高维(至少对于高斯函数)。最后，我们使用提出的实现来学习图像补丁模型，并将学习到的模型应用于图像去噪。虽然我们推测，高度优化的分布式实现，比如c++，可能比Julia中的提议实现更快，但从我们作为机器学习研究人员(而不是HPC研究人员)的角度来看，由于易于开发和抽象级别，后者也提供了实用和货币价值。我们的代码可以在https://github.com/dinarior/dpmm subclusters.jl上公开获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量