Throughput-oriented and Accuracy-aware DNN Training with BFloat16 on GPU

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00176

Zhen Xie, Siddhisanket Raskar, M. Emani

{"title":"Throughput-oriented and Accuracy-aware DNN Training with BFloat16 on GPU","authors":"Zhen Xie, Siddhisanket Raskar, M. Emani","doi":"10.1109/IPDPSW55747.2022.00176","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and achieved extraordinary success in many areas. The training of DNNs is commonly compute and memory-intensive, which has resulted in several optimizations in the training phase. Among them, reduced precision is a typical and widely used technique to accelerate DNN training and reduce memory requirements. However, applying a widely adopted reduced precision format such as Float16 to all involved operations in DNN training is not optimal as the use of Float16 in some operations can hurt model accuracy. Meanwhile, additional optimizations including loss scaling and autocast techniques can mitigate the accuracy loss but lead to inherent overhead and inadequate use of reduced precision. In this work, we leverage another reduced precision format, BFloat16, and introduce a throughput-oriented and accuracy-aware approach to maximize the performance potential of DNN training. Since the high throughput provided by BFloat16 format is accompanied by low precision of the floating-point representation, this approach achieves high throughput by using BFloat16 on all DNN op-erations and avoids the accuracy loss through a customized accuracy-aware normalization. Results show that our approach outperforms the state-of-the-art mixed precision training by 1.21x on an NVIDIA A100 GPU.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and achieved extraordinary success in many areas. The training of DNNs is commonly compute and memory-intensive, which has resulted in several optimizations in the training phase. Among them, reduced precision is a typical and widely used technique to accelerate DNN training and reduce memory requirements. However, applying a widely adopted reduced precision format such as Float16 to all involved operations in DNN training is not optimal as the use of Float16 in some operations can hurt model accuracy. Meanwhile, additional optimizations including loss scaling and autocast techniques can mitigate the accuracy loss but lead to inherent overhead and inadequate use of reduced precision. In this work, we leverage another reduced precision format, BFloat16, and introduce a throughput-oriented and accuracy-aware approach to maximize the performance potential of DNN training. Since the high throughput provided by BFloat16 format is accompanied by low precision of the floating-point representation, this approach achieves high throughput by using BFloat16 on all DNN op-erations and avoids the accuracy loss through a customized accuracy-aware normalization. Results show that our approach outperforms the state-of-the-art mixed precision training by 1.21x on an NVIDIA A100 GPU.

查看原文本刊更多论文

基于GPU的BFloat16的吞吐量导向和精度感知DNN训练

深度神经网络(dnn)已经改变了人工智能领域，并在许多领域取得了非凡的成功。dnn的训练通常是计算和内存密集型的，这导致了训练阶段的几个优化。其中，降低精度是一种典型且广泛使用的加速深度神经网络训练和降低内存需求的技术。然而，将Float16等广泛采用的降低精度格式应用于DNN训练中所有涉及的操作并不是最佳的，因为在某些操作中使用Float16会损害模型的准确性。同时，包括损耗缩放和自动铸造技术在内的其他优化可以减轻精度损失，但会导致固有的开销和对降低精度的不充分利用。在这项工作中，我们利用了另一种降低精度的格式BFloat16，并引入了一种面向吞吐量和精度感知的方法，以最大限度地提高DNN训练的性能潜力。由于BFloat16格式提供的高吞吐量伴随着浮点表示的低精度，因此该方法通过在所有DNN操作上使用BFloat16来实现高吞吐量，并通过定制的精度感知归一化来避免精度损失。结果表明，我们的方法在NVIDIA A100 GPU上比最先进的混合精度训练高出1.21倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量