Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column

M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud
{"title":"Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column","authors":"M. Hasanzadeh-Mofrad, R. Melhem, Muhammad Yousuf Ahmad, Mohammad Hammoud","doi":"10.1109/HPEC.2019.8916494","DOIUrl":null,"url":null,"abstract":"Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in-memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge [1]. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 \\times$, and $150 \\times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2019.8916494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in-memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge [1]. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 \times$, and $150 \times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions.
基于压缩稀疏列的稀疏深度神经网络多线程分层训练
与训练密集(全连接)深度神经网络相比,训练稀疏深度神经网络(DNN)本质上是更少的内存密集型和处理器密集型。在本文中,我们使用稀疏矩阵-矩阵乘法(SpMM)来训练稀疏连接的dnn,而不是用于训练密集dnn的密集矩阵-矩阵乘法。在我们的C/ c++实现中,我们广泛使用内存中的压缩稀疏列(CSC)数据结构来存储和遍历神经网络层。此外,我们一层一层地训练神经网络,在每一层中,我们使用1D-Column分区来划分线程之间训练所需的计算量。为了加快计算速度,我们在执行SpMM操作时应用偏置和激活函数。我们使用MIT/IEEE/Amazon HPEC图形挑战[1]提供的基准测试来测试我们的实现。根据我们的结果,我们的单线程(1核)和多线程(12核)实现比该挑战提供的串行Matlab结果快22倍,快150倍。我们认为这种加速是由于我们用于平衡计算线程之间SpMM操作的计算的1D-Column分区,我们用于稀疏矩阵的内存(重新)分配的有效机制,以及SpMM结果累积与应用偏差和激活函数的重叠。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信