Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2020-12-01 DOI:10.1109/HiPC50609.2020.00028

Lawton Manning, Grey Ballard, R. Kannan, Haesun Park

{"title":"Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization","authors":"Lawton Manning, Grey Ballard, R. Kannan, Haesun Park","doi":"10.1109/HiPC50609.2020.00028","DOIUrl":null,"url":null,"abstract":"Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnegative data, either for computing a flat partitioning of a dataset or for determining a hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical clustering that uses a divide-and-conquer approach based on rank-two NMF to split a data set into two cohesive parts. Not only does this approach uncover more structure in the data than a flat NMF clustering, but also rank-two NMF can be computed more quickly than for general ranks, providing comparable overall time to solution. Our data distribution and parallelization strategies are designed to maintain computational load balance throughout the data-dependent hierarchy of computation while limiting interprocess communication, allowing the algorithm to scale to large dense and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit supercomputer), applying the hierarchical clustering approach to hyperspectral imaging and image classification data. Our algorithm for Rank-2 NMF scales perfectly on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x speedup scaling from 10 to 80 nodes on the 800 GB dataset.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC50609.2020.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnegative data, either for computing a flat partitioning of a dataset or for determining a hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical clustering that uses a divide-and-conquer approach based on rank-two NMF to split a data set into two cohesive parts. Not only does this approach uncover more structure in the data than a flat NMF clustering, but also rank-two NMF can be computed more quickly than for general ranks, providing comparable overall time to solution. Our data distribution and parallelization strategies are designed to maintain computational load balance throughout the data-dependent hierarchy of computation while limiting interprocess communication, allowing the algorithm to scale to large dense and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit supercomputer), applying the hierarchical clustering approach to hyperspectral imaging and image classification data. Our algorithm for Rank-2 NMF scales perfectly on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x speedup scaling from 10 to 80 nodes on the 800 GB dataset.

查看原文本刊更多论文

基于二阶非负矩阵分解的并行分层聚类

非负矩阵分解(NMF)是聚类非负数据的有效工具，用于计算数据集的平面划分或确定相似性层次。在本文中，我们提出了一种并行分层聚类算法，该算法使用基于二级NMF的分而治之方法将数据集分成两个内聚部分。与平面NMF聚类相比，这种方法不仅可以揭示数据中的更多结构，而且可以比一般排名更快地计算排名2的NMF，从而提供可比较的解决方案总时间。我们的数据分布和并行化策略旨在在整个依赖数据的计算层次中保持计算负载平衡，同时限制进程间通信，允许算法扩展到大型密集和稀疏数据集。我们在数据大小(高达800 GB)和处理器数量(高达Summit超级计算机的80个节点)方面展示了并行算法的可扩展性，并将分层聚类方法应用于高光谱成像和图像分类数据。我们的Rank-2 NMF算法在多达1000个核心上完美扩展，整个分层聚类方法在800 GB数据集上从10个节点扩展到80个节点，实现了5.9倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量