大型SpMV问题的算法与硬件协同优化解决方案

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI:10.1109/HPEC.2017.8091096

Fazle Sadi, L. Pileggi, F. Franchetti

{"title":"大型SpMV问题的算法与硬件协同优化解决方案","authors":"Fazle Sadi, L. Pileggi, F. Franchetti","doi":"10.1109/HPEC.2017.8091096","DOIUrl":null,"url":null,"abstract":"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Algorithm and hardware co-optimized solution for large SpMV problems\",\"authors\":\"Fazle Sadi, L. Pileggi, F. Franchetti\",\"doi\":\"10.1109/HPEC.2017.8091096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.\",\"PeriodicalId\":364903,\"journal\":{\"name\":\"2017 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC.2017.8091096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2017.8091096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

稀疏矩阵向量乘法(SpMV)是许多科学和工程应用的基本核心。然而，SpMV在商用货架(COTS)架构上的性能和效率很差，特别是当数据大小超过片上存储器或最后一级缓存(LLC)时。在这项工作中，我们提出了一个算法协同优化的大型SpMV问题硬件加速器。我们首先探索各种SpMV算法在数据传输特性上的基本差异。我们提出了一种算法，它需要最少的数据传输，同时确保所有访问的主内存流。然而，该算法需要高效的多路合并，这在COTS架构下很难实现。因此，我们提出了一种硬件加速器模型，其中包括用于多路合并操作的专用集成电路(ASIC)。该加速器采用了最先进的3D堆叠高带宽存储器(HBM)，以展示该算法与最新技术相结合的能力。使用标准基准测试的仿真结果显示，在能源效率和性能方面，与商用库相比，COTS架构的改进超过100倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Algorithm and hardware co-optimized solution for large SpMV problems

Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量