{"title":"Kunlun: A 14nm High-Performance AI Processor for Diversified Workloads","authors":"Jian Ouyang, Xueliang Du, Yin Ma, Jiaqiang Liu","doi":"10.1109/ISSCC42613.2021.9366056","DOIUrl":null,"url":null,"abstract":"In order to be able to handle a wide range of AI applications, such as for speech, image, language and autonomous driving, it is necessary that an AI accelerator be flexible enough to handle diversified workloads. Baidu Kunlun, an AI chip designed in-house by Baidu, achieves this capability with high programmability, flexibility and performance. Baidu Kunlun was inspired by the XPU architecture [1]. The chip is implemented in Samsung 14nm process technology. Its peak performance is 230TOPS@INT8 at 900MHz and up to 281TOPS@INT8 at 1.1GHz boost frequency. The memory bandwidth is 512GB/s and the peak power is 160W. Baidu Kunlun achieves good performance across various types of workloads. With 900MHz base frequency, the latencies of BERT, ResNet50, YOLOv3 are $1.7 \\times, 1.2 \\times$ and $2 \\times$ less than an Nvidia T4 GPU, respectively, with optimizations from TensorRT. Recently, Baidu Kunlun has been deployed in data centers in Baidu to serve many applications. It achieves 1.5-to$- 3 \\times$ better performance for several models within the search engine vs. the Nvidia T4.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9366056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
In order to be able to handle a wide range of AI applications, such as for speech, image, language and autonomous driving, it is necessary that an AI accelerator be flexible enough to handle diversified workloads. Baidu Kunlun, an AI chip designed in-house by Baidu, achieves this capability with high programmability, flexibility and performance. Baidu Kunlun was inspired by the XPU architecture [1]. The chip is implemented in Samsung 14nm process technology. Its peak performance is 230TOPS@INT8 at 900MHz and up to 281TOPS@INT8 at 1.1GHz boost frequency. The memory bandwidth is 512GB/s and the peak power is 160W. Baidu Kunlun achieves good performance across various types of workloads. With 900MHz base frequency, the latencies of BERT, ResNet50, YOLOv3 are $1.7 \times, 1.2 \times$ and $2 \times$ less than an Nvidia T4 GPU, respectively, with optimizations from TensorRT. Recently, Baidu Kunlun has been deployed in data centers in Baidu to serve many applications. It achieves 1.5-to$- 3 \times$ better performance for several models within the search engine vs. the Nvidia T4.