研究半精度算法加速密集线性系统求解

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Pub Date : 2017-11-12 DOI:10.1145/3148226.3148237

A. Haidar, Panruo Wu, S. Tomov, J. Dongarra

{"title":"研究半精度算法加速密集线性系统求解","authors":"A. Haidar, Panruo Wu, S. Tomov, J. Dongarra","doi":"10.1145/3148226.3148237","DOIUrl":null,"url":null,"abstract":"The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.","PeriodicalId":440657,"journal":{"name":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":"{\"title\":\"Investigating half precision arithmetic to accelerate dense linear system solvers\",\"authors\":\"A. Haidar, Panruo Wu, S. Tomov, J. Dongarra\",\"doi\":\"10.1145/3148226.3148237\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.\",\"PeriodicalId\":440657,\"journal\":{\"name\":\"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"53\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3148226.3148237\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3148226.3148237","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

摘要

在混合精度计算方法中使用低精度算法已成为加速众多科学计算应用的有力工具。特别是人工智能(AI)将这一点推向了当前的极端，在基于神经网络的方法中使用半精度浮点算法(FP16)。FP16的吸引力在于在当今强大的多核GPU加速器上使用它可以实现的高性能，例如，像NVIDIA V100, FP16可以提供120 TeraFLOPS。我们提出了一项调查，表明其他HPC应用也可以利用这种能力，特别是解决一般HPC问题Ax = b，其中A是一个大的密集矩阵，解决方案需要在FP32或FP64精度。我们的方法基于混合精度迭代优化技术-我们将先前的进展推广并扩展到一个框架中，为此我们开发了特定于体系结构的算法和高度调优的实现，以解决在高端gpu上有效并行化，缩放和使用FP16算法的主要计算挑战。随后，我们首次展示了FP16算法的使用如何显着加速，以及制作更节能的FP32或fp64精度Ax = b求解器。我们的结果是可重复的，开发将通过MAGMA库提供。我们在实践中量化了该方法的性能和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Investigating half precision arithmetic to accelerate dense linear system solvers

The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today's powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique - we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

自引率

0.00%

发文量