SummaGen: Parallel Matrix-Matrix Multiplication Based on Non-rectangular Partitions for Heterogeneous HPC Platforms

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI:10.1109/IPDPSW.2019.00017

Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky

{"title":"SummaGen: Parallel Matrix-Matrix Multiplication Based on Non-rectangular Partitions for Heterogeneous HPC Platforms","authors":"Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky","doi":"10.1109/IPDPSW.2019.00017","DOIUrl":null,"url":null,"abstract":"Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.

查看原文本刊更多论文

基于非矩形分区的异构HPC平台并行矩阵-矩阵乘法

密集矩阵的并行矩阵-矩阵乘法(PMM)是高性能计算领域并行线性代数库的基本核心。在异构平台上寻找矩阵的最佳形状以有效执行PMM的问题具有引人入胜的历史，包括两个不同的线程。第一个线程纯粹关注矩形分区，而第二个线程放松了矩形分区约束，允许非矩形分区。然而，第二条线索的研究工作完全是理论性的。没有软件实现可以促进对所提出的分区形状的实际性能和最优性的实验研究。我们在这项工作中解决了这一差距。我们提出了一个基于非矩形分区的PMM的实现，称为SummaGen。为了研究其有效性，我们比较了PMM在四种分区形状下的性能，这些分区形状被证明是最优的，适用于三种处理器情况，其中处理器的速度由正实数表示。我们在一个混合异构多加速器NUMA节点上进行了实验，该节点由三个异构设备组成，一个双插槽Intel Haswell多核CPU，一个Nvidia K40 GPU和一个Intel Xeon Phi 3120P。我们表明，在速度恒定的问题大小范围内，这四种形状表现出相同的性能(平均百分比差异为8%)，证实了这些形状在实践中的最优性。我们进一步证明，在这种情况下，这四种形状表现出相同的动态能量消耗。我们还研究了PMM在使用功能性能模型(FPMs)的负载不平衡数据分区算法进行矩阵分解的相同分区形状下的性能。实现的峰值和平均性能分别是该机器理论峰值浮点性能的80%和70%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量