Optimization of GPU and CPU acceleration for neural networks layers implemented in python

R. Dogaru, I. Dogaru
{"title":"Optimization of GPU and CPU acceleration for neural networks layers implemented in python","authors":"R. Dogaru, I. Dogaru","doi":"10.1109/ISEEE.2017.8170680","DOIUrl":null,"url":null,"abstract":"Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better optimize the problem parameters (neural structure and training set size) as well as the GPU parameters in order to maximize the acceleration (relative to a CPU implementation). It is shown that by maximizing the load (number of threads on each computational GPU core) and by a proper allocation of the GPU global memory, very large speedups (100–250 times) with respect to the CPU implementation can be achieved while using the convenient NUMBA Python package supporting CUDA programming of GPU. Consequently, it is shown that given a problem to be posed to a neural network a convenient decomposition of the network can be done in order to allocate optimally the parts of the computation to the GPU in order to maximize efficiency. Also, for CPU implementations it was found that Intel's MKL library (called from NUMPY package) can offer efficient implementation of neural layers, comparable to what is achieved using GPU.","PeriodicalId":276733,"journal":{"name":"2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISEEE.2017.8170680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better optimize the problem parameters (neural structure and training set size) as well as the GPU parameters in order to maximize the acceleration (relative to a CPU implementation). It is shown that by maximizing the load (number of threads on each computational GPU core) and by a proper allocation of the GPU global memory, very large speedups (100–250 times) with respect to the CPU implementation can be achieved while using the convenient NUMBA Python package supporting CUDA programming of GPU. Consequently, it is shown that given a problem to be posed to a neural network a convenient decomposition of the network can be done in order to allocate optimally the parts of the computation to the GPU in order to maximize efficiency. Also, for CPU implementations it was found that Intel's MKL library (called from NUMPY package) can offer efficient implementation of neural layers, comparable to what is achieved using GPU.
优化GPU和CPU加速的神经网络层实现在python
许多神经架构,包括RBF、SVM、FSVC分类器或深度学习解决方案,都需要有效地实现神经元层,每个神经元层都有给定数量的m个神经元,一组特定的参数,并对N个特征向量的训练或测试集进行操作,每个特征向量的维度为N。在这里,我们研究如何在GPU内核上分配计算,以及如何更好地优化问题参数(神经结构和训练集大小)以及GPU参数,以便最大化加速(相对于CPU实现)。结果表明,通过最大化负载(每个计算GPU核心上的线程数)和适当分配GPU全局内存,在使用方便的NUMBA Python包支持GPU的CUDA编程时,可以实现相对于CPU实现的非常大的加速(100-250倍)。结果表明,给定一个问题要提交给一个神经网络,一个方便的网络分解可以完成,以便最优地分配计算的部分给GPU,以最大限度地提高效率。此外,对于CPU实现,我们发现英特尔的MKL库(从NUMPY包调用)可以提供有效的神经层实现,可与使用GPU实现的实现相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信