F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga
{"title":"异构体系结构上大矩阵的便携式稀疏求解器框架","authors":"F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga","doi":"10.1109/HiPC56025.2022.00030","DOIUrl":null,"url":null,"abstract":"Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Portable Sparse Solver Framework for Large Matrices on Heterogeneous Architectures\",\"authors\":\"F. Rabbi, C. Daley, Ümit V. Çatalyürek, H. Aktulga\",\"doi\":\"10.1109/HiPC56025.2022.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.\",\"PeriodicalId\":119363,\"journal\":{\"name\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC56025.2022.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Portable Sparse Solver Framework for Large Matrices on Heterogeneous Architectures
Programming applications on heterogeneous systems with hardware accelerators is challenging due to the disjoint address spaces between the host (CPU) and the device (GPU). The limited device memory further exacerbates the challenges as most data-intensive applications will not fit in the limited device memory. CUDA Unified Memory (UM) was introduced to mitigate such challenges. UM improves GPU programmability by supporting oversubscription, on-demand paging, and migration. However, when the working set of an application exceeds the device memory capacity, the resulting data movement can cause significant performance losses. We propose a tiling-based task-parallel framework, named DeepSparseGPU, to accelerate sparse eigensolvers on GPUs by minimizing data movement between the host and device. To this end, we tile all operations in a sparse solver and express the entire computation as a directed acyclic graph (DAG). We design and develop a memory manager (MM) to execute larger inputs that do not fit into GPU memory. MM keeps track of the data on CPU and GPU, and automatically moves data between them as needed. We use OpenMP target offload in our implementation to achieve portability beyond NVIDIA hardware. Performance evaluations show that DeepSparseGPU transfers 1.39x-2.18x less host to device (H2D) and device to host (D2H) data, while executing up to 2.93x faster than the UM-based baseline version.