A multilevel compressed sparse row format for efficient sparse computations on multicore processors

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI:10.1109/HiPC.2014.7116882

H. Kabir, J. Booth, P. Raghavan

{"title":"A multilevel compressed sparse row format for efficient sparse computations on multicore processors","authors":"H. Kabir, J. Booth, P. Raghavan","doi":"10.1109/HiPC.2014.7116882","DOIUrl":null,"url":null,"abstract":"We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 21st International Conference on High Performance Computing (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2014.7116882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.

查看原文本刊更多论文

用于多核处理器上高效稀疏计算的多级压缩稀疏行格式

我们寻求在非均匀内存访问(NUMA)的多核处理器上提高稀疏矩阵计算的性能。典型的实现使用带宽降低矩阵的排序来增加访问的局部性，并使用压缩存储格式来存储和操作非零值。我们提出了一种新的多层存储格式和配套的排序方案，作为映射到NUMA层次结构的显式适应。更具体地说，我们提出了CSR-k，这是一种流行的压缩稀疏行(CSR)格式的多级形式，适用于内存子系统中具有k bbbb1个良好区分级别的多核处理器。此外，我们开发了Band-k，这是传统带宽减少方案的一种改进形式，用于将csr中表示的矩阵转换为我们提出的CSR-k。我们在Intel Westmere处理器上使用CSR-2对12个行密度在3到45之间的大型稀疏矩阵测试套件评估了广泛使用且重要的稀疏矩阵向量乘法(SpMV)内核的性能。在32个内核上，测试套件中所有矩阵的平均执行时间，使用csr -2的SpMV的执行时间不到最先进的自动调优SpMV所花费时间的42%，从而节省了大约56%的能源。此外，平均而言，32核自动调优SpMV相对于其1核性能的并行加速是8.18，而CSR-2的并行加速是19.71。我们的分析表明，使用CSR-2的SpMV的更高性能来自于在共享L3缓存中实现更高的x重用，而不会因填充原始零而产生开销。此外，使用CSR-2的SpMV的预处理成本可以平均分摊到使用CSR的97次SpMV迭代中，并且大大低于自动调优实现所需的513次迭代。基于这些结果，CSR-k似乎是一种很有前途的多层CSR公式，用于使稀疏计算适应具有NUMA内存层次结构的多核处理器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 21st International Conference on High Performance Computing (HiPC)

自引率

0.00%

发文量