swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight

Jiarui Fang, H. Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, Guangwen Yang
{"title":"swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight","authors":"Jiarui Fang, H. Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, Guangwen Yang","doi":"10.1109/IPDPS.2017.20","DOIUrl":null,"url":null,"abstract":"To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating deep learning applications on the newly announced world-leading supercomputer, Sunway TaihuLight. Targeting SW26010 processor, we derive a performance model that guides us in the process of identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the chip. By performing a systematic optimization that explores major factors, such as organization of convolution loops, blocking techniques, register data communication schemes, as well as reordering strategies for the two pipelines of instructions, we manage to achieve a double-precision performance over 1.6 Tflops for the convolution kernel, achieving 54% of the theoretical peak. Compared with Tesla K40m with cuDNNv5, swDNN results in 1.91-9.75x performance speedup in an evaluation with over 100 parameter configurations.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 67

Abstract

To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating deep learning applications on the newly announced world-leading supercomputer, Sunway TaihuLight. Targeting SW26010 processor, we derive a performance model that guides us in the process of identifying the most suitable approach for mapping the convolutional neural networks (CNNs) onto the 260 cores within the chip. By performing a systematic optimization that explores major factors, such as organization of convolution loops, blocking techniques, register data communication schemes, as well as reordering strategies for the two pipelines of instructions, we manage to achieve a double-precision performance over 1.6 Tflops for the convolution kernel, achieving 54% of the theoretical peak. Compared with Tesla K40m with cuDNNv5, swDNN results in 1.91-9.75x performance speedup in an evaluation with over 100 parameter configurations.
swDNN:一个加速神威太湖之光上深度学习应用的库
为了探索在其他商用芯片而非gpu上训练复杂深度神经网络(dnn)的潜力,我们报告了我们在swDNN上的工作,这是一个高效的库,用于加速世界领先的超级计算机神威太湖之光上的深度学习应用。针对SW26010处理器,我们推导了一个性能模型,该模型指导我们确定将卷积神经网络(cnn)映射到芯片内260个内核的最合适方法。通过对主要因素(如卷积循环的组织、阻塞技术、寄存器数据通信方案以及两个指令管道的重新排序策略)进行系统优化,我们设法实现了卷积核在1.6 Tflops以上的双精度性能,达到了理论峰值的54%。与搭载cuDNNv5的特斯拉K40m相比,在超过100个参数配置的评估中,swDNN的性能提升了1.91-9.75倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信