Optimization of GPU and CPU acceleration for neural networks layers implemented in python

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE) Pub Date : 2017-10-01 DOI:10.1109/ISEEE.2017.8170680

R. Dogaru, I. Dogaru

{"title":"Optimization of GPU and CPU acceleration for neural networks layers implemented in python","authors":"R. Dogaru, I. Dogaru","doi":"10.1109/ISEEE.2017.8170680","DOIUrl":null,"url":null,"abstract":"Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better optimize the problem parameters (neural structure and training set size) as well as the GPU parameters in order to maximize the acceleration (relative to a CPU implementation). It is shown that by maximizing the load (number of threads on each computational GPU core) and by a proper allocation of the GPU global memory, very large speedups (100–250 times) with respect to the CPU implementation can be achieved while using the convenient NUMBA Python package supporting CUDA programming of GPU. Consequently, it is shown that given a problem to be posed to a neural network a convenient decomposition of the network can be done in order to allocate optimally the parts of the computation to the GPU in order to maximize efficiency. Also, for CPU implementations it was found that Intel's MKL library (called from NUMPY package) can offer efficient implementation of neural layers, comparable to what is achieved using GPU.","PeriodicalId":276733,"journal":{"name":"2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISEEE.2017.8170680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better optimize the problem parameters (neural structure and training set size) as well as the GPU parameters in order to maximize the acceleration (relative to a CPU implementation). It is shown that by maximizing the load (number of threads on each computational GPU core) and by a proper allocation of the GPU global memory, very large speedups (100–250 times) with respect to the CPU implementation can be achieved while using the convenient NUMBA Python package supporting CUDA programming of GPU. Consequently, it is shown that given a problem to be posed to a neural network a convenient decomposition of the network can be done in order to allocate optimally the parts of the computation to the GPU in order to maximize efficiency. Also, for CPU implementations it was found that Intel's MKL library (called from NUMPY package) can offer efficient implementation of neural layers, comparable to what is achieved using GPU.

查看原文本刊更多论文

优化GPU和CPU加速的神经网络层实现在python

许多神经架构，包括RBF、SVM、FSVC分类器或深度学习解决方案，都需要有效地实现神经元层，每个神经元层都有给定数量的m个神经元，一组特定的参数，并对N个特征向量的训练或测试集进行操作，每个特征向量的维度为N。在这里，我们研究如何在GPU内核上分配计算，以及如何更好地优化问题参数(神经结构和训练集大小)以及GPU参数，以便最大化加速(相对于CPU实现)。结果表明，通过最大化负载(每个计算GPU核心上的线程数)和适当分配GPU全局内存，在使用方便的NUMBA Python包支持GPU的CUDA编程时，可以实现相对于CPU实现的非常大的加速(100-250倍)。结果表明，给定一个问题要提交给一个神经网络，一个方便的网络分解可以完成，以便最优地分配计算的部分给GPU，以最大限度地提高效率。此外，对于CPU实现，我们发现英特尔的MKL库(从NUMPY包调用)可以提供有效的神经层实现，可与使用GPU实现的实现相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)

自引率

0.00%

发文量