3.2 A100数据中心GPU与安培架构

2021 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2021-02-13 DOI:10.1109/ISSCC42613.2021.9365803

Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany

{"title":"3.2 A100数据中心GPU与安培架构","authors":"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany","doi":"10.1109/ISSCC42613.2021.9365803","DOIUrl":null,"url":null,"abstract":"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"3.2 The A100 Datacenter GPU and Ampere Architecture\",\"authors\":\"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany\",\"doi\":\"10.1109/ISSCC42613.2021.9365803\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.\",\"PeriodicalId\":371093,\"journal\":{\"name\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42613.2021.9365803\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9365803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

现代云数据中心中计算密集型应用程序的多样性推动了gpu加速云计算的爆炸式增长。这些应用包括人工智能深度学习训练和推理、数据分析、科学计算、基因组学、边缘视频分析和5G服务、图形渲染和云游戏。A100 GPU针对这些工作负载引入了几个特性:支持细粒度稀疏的$3^{rd}-$代张量核心，支持新的BFIoat16 (BF16)， tensorfio32 (TF32)和FP64数据类型，支持多实例GPU (MIG)虚拟化的横向扩展，以及支持$3^{rd}-$代50Gbps NVLink I/0接口(NVLink3)和NVSwitch GPU间通信的横向扩展。如图3.2.1所示，A100包含108个流式多处理器(SMs)和6912个CUDA内核。SMs由40MB二级缓存和1。56TB/s HBM2内存带宽(BW)。在1.41GHz时，A100提供1248T0PS (8b整数)，624TFLOPS (FP16)和312tflops (TF32)的有效峰值，包括稀疏性优化。A100芯片采用台积电7nm N7工艺，包含54B个晶体管，尺寸为826mm2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

3.2 The A100 Datacenter GPU and Ampere Architecture

The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量