利用GPU张量核心的双精度欧几里得距离计算

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-09-22 DOI:10.1109/HiPC56025.2022.00029

Benoît Gallet, M. Gowanlock

{"title":"利用GPU张量核心的双精度欧几里得距离计算","authors":"Benoît Gallet, M. Gowanlock","doi":"10.1109/HiPC56025.2022.00029","DOIUrl":null,"url":null,"abstract":"Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are heavily studied for machine learning and closely related fields, where their high efficiency is undeniable, MMA operations are not unique to these fields. More generally, any computation that can be expressed as MMA operations can leverage TCs, and potentially benefit from their higher computational throughput compared to other general-purpose cores, such as CUDA cores on Nvidia GPUs. In this paper, we propose the first double precision (FP64) Euclidean distance calculation algorithm, which is expressed as MMA operations to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores. To show that the Euclidean distance can be accelerated in a real-world application, we evaluate our proposed TC algorithm on the distance similarity self-join problem, as the most computationally intensive part of the algorithm consists of computing distances in a multi-dimensional space. We find that the performance gain from using the tensor core algorithm over the CUDA core algorithm depends weakly on the dataset size and distribution, but is strongly dependent on data dimensionality. Overall, TCs are a compelling alternative to CUDA cores, particularly when the data dimensionality is low (≤ 4), as we achieve an average speedup of 1.28× and up to 2.23× against a state-of-the-art GPU distance similarity self-join algorithm. Furthermore, because this paper is among the first to explore the use of TCs for FP64 general-purpose computation, future research is promising.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations\",\"authors\":\"Benoît Gallet, M. Gowanlock\",\"doi\":\"10.1109/HiPC56025.2022.00029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are heavily studied for machine learning and closely related fields, where their high efficiency is undeniable, MMA operations are not unique to these fields. More generally, any computation that can be expressed as MMA operations can leverage TCs, and potentially benefit from their higher computational throughput compared to other general-purpose cores, such as CUDA cores on Nvidia GPUs. In this paper, we propose the first double precision (FP64) Euclidean distance calculation algorithm, which is expressed as MMA operations to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores. To show that the Euclidean distance can be accelerated in a real-world application, we evaluate our proposed TC algorithm on the distance similarity self-join problem, as the most computationally intensive part of the algorithm consists of computing distances in a multi-dimensional space. We find that the performance gain from using the tensor core algorithm over the CUDA core algorithm depends weakly on the dataset size and distribution, but is strongly dependent on data dimensionality. Overall, TCs are a compelling alternative to CUDA cores, particularly when the data dimensionality is low (≤ 4), as we achieve an average speedup of 1.28× and up to 2.23× against a state-of-the-art GPU distance similarity self-join algorithm. Furthermore, because this paper is among the first to explore the use of TCs for FP64 general-purpose computation, future research is promising.\",\"PeriodicalId\":119363,\"journal\":{\"name\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC56025.2022.00029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

张量核(TCs)是一种专用集成电路(ASIC)，是图形处理单元(GPU)架构的新成员。因此，tc被有意地设计为极大地提高矩阵乘法-累积(MMA)操作的性能。虽然TCs在机器学习和密切相关的领域得到了大量研究，在这些领域它们的高效率是不可否认的，但MMA操作并不是这些领域所独有的。更一般地说，任何可以表示为MMA操作的计算都可以利用tc，并且与其他通用内核(例如Nvidia gpu上的CUDA内核)相比，它们可能受益于更高的计算吞吐量。在本文中，我们提出了第一个双精度(FP64)欧几里得距离计算算法，该算法表示为MMA操作，以利用Nvidia gpu上的tc，而不是更常用的CUDA内核。为了证明在实际应用中欧几里得距离可以被加速，我们在距离相似自连接问题上评估了我们提出的TC算法，因为该算法中计算量最大的部分是在多维空间中计算距离。我们发现，与CUDA核心算法相比，使用张量核心算法的性能增益对数据集大小和分布的依赖性较弱，但对数据维数的依赖性很强。总的来说，tc是CUDA核心的一个引人注目的替代方案，特别是当数据维数较低(≤4)时，因为我们在最先进的GPU距离相似自连接算法上实现了1.28倍和高达2.23倍的平均加速。此外，由于本文是第一批探索将tc用于FP64通用计算的论文之一，因此未来的研究是有希望的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations

Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are heavily studied for machine learning and closely related fields, where their high efficiency is undeniable, MMA operations are not unique to these fields. More generally, any computation that can be expressed as MMA operations can leverage TCs, and potentially benefit from their higher computational throughput compared to other general-purpose cores, such as CUDA cores on Nvidia GPUs. In this paper, we propose the first double precision (FP64) Euclidean distance calculation algorithm, which is expressed as MMA operations to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores. To show that the Euclidean distance can be accelerated in a real-world application, we evaluate our proposed TC algorithm on the distance similarity self-join problem, as the most computationally intensive part of the algorithm consists of computing distances in a multi-dimensional space. We find that the performance gain from using the tensor core algorithm over the CUDA core algorithm depends weakly on the dataset size and distribution, but is strongly dependent on data dimensionality. Overall, TCs are a compelling alternative to CUDA cores, particularly when the data dimensionality is low (≤ 4), as we achieve an average speedup of 1.28× and up to 2.23× against a state-of-the-art GPU distance similarity self-join algorithm. Furthermore, because this paper is among the first to explore the use of TCs for FP64 general-purpose computation, future research is promising.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量