Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint

IF 2.5 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

International Journal of High Performance Computing Applications Pub Date : 2023-01-03 DOI:10.1177/10943420221136848

Florent Lopez, Théo Mary

{"title":"Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint","authors":"Florent Lopez, Théo Mary","doi":"10.1177/10943420221136848","DOIUrl":null,"url":null,"abstract":"Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"165 - 179"},"PeriodicalIF":2.5000,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of High Performance Computing Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/10943420221136848","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 9

Abstract

Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.

查看原文本刊更多论文

GPU张量核上的混合精度LU因子分解：减少数据移动和内存占用

配备混合精度张量核心单元的现代GPU在加速密集线性代数运算（如LU因子分解）方面具有巨大潜力。然而，最先进的混合半精度/单精度LU分解算法都需要以单精度存储矩阵，这导致了昂贵的数据移动和存储成本。这可以解释为，简单地将存储精度从单一切换到一半会导致精度的显著损失，从而丧失使用张量核心技术带来的所有精度优势。在本文中，我们提出了一种新的因子分解算法，该算法能够以半精度存储矩阵，而不会导致任何显著的精度损失。我们的方法基于一种左向方案，该方案采用了大小可控的单精度缓冲区和一种在面板分解中利用张量核的混合精度双分割算法。我们的数值结果表明，与现有技术相比，所提出的方法具有相似的精度，但数据移动和内存占用只有一半，因此可能更快：它在V100和A100 GPU上分别实现了2倍和3.5倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of High Performance Computing Applications 工程技术-计算机：跨学科应用

CiteScore

6.10

自引率

6.50%

发文量

审稿时长

>12 weeks

期刊介绍： With ever increasing pressure for health services in all countries to meet rising demands, improve their quality and efficiency, and to be more accountable; the need for rigorous research and policy analysis has never been greater. The Journal of Health Services Research & Policy presents the latest scientific research, insightful overviews and reflections on underlying issues, and innovative, thought provoking contributions from leading academics and policy-makers. It provides ideas and hope for solving dilemmas that confront all countries.