异构体系结构上大矩阵的便携式稀疏求解器框架

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI:10.1109/HiPC56025.2022.00030

F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga

{"title":"异构体系结构上大矩阵的便携式稀疏求解器框架","authors":"F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga","doi":"10.1109/HiPC56025.2022.00030","DOIUrl":null,"url":null,"abstract":"Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Portable Sparse Solver Framework for Large Matrices on Heterogeneous Architectures\",\"authors\":\"F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga\",\"doi\":\"10.1109/HiPC56025.2022.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.\",\"PeriodicalId\":119363,\"journal\":{\"name\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC56025.2022.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由于主机(CPU)和设备(GPU)之间的地址空间不相交，在带有硬件加速器的异构系统上编程应用程序是具有挑战性的。有限的设备内存进一步加剧了挑战，因为大多数数据密集型应用程序不适合有限的设备内存。CUDA统一内存(UM)的引入缓解了这些挑战。UM通过支持超额订阅、按需分页和迁移来提高GPU的可编程性。但是，当应用程序的工作集超过设备内存容量时，所产生的数据移动可能会导致严重的性能损失。我们提出了一个基于平铺的任务并行框架，名为DeepSparseGPU，通过最小化主机和设备之间的数据移动来加速gpu上的稀疏特征解算。为此，我们将所有操作平铺在一个稀疏求解器中，并将整个计算表示为一个有向无环图(DAG)。我们设计并开发了一个内存管理器(MM)来执行不适合GPU内存的较大输入。MM跟踪CPU和GPU上的数据，并根据需要在它们之间自动移动数据。我们在我们的实现中使用OpenMP目标卸载来实现超越NVIDIA硬件的可移植性。性能评估表明，与基于um的基准版本相比，DeepSparseGPU的主机到设备(H2D)和设备到主机(D2H)数据传输速度减少了1.39x-2.18倍，执行速度提高了2.93倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Portable Sparse Solver Framework for Large Matrices on Heterogeneous Architectures

Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量