Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI:10.1109/IPDPSW.2014.194

Lei Jin, Zhaokang Wang, Rong Gu, C. Yuan, Y. Huang

{"title":"Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor","authors":"Lei Jin, Zhaokang Wang, Rong Gu, C. Yuan, Y. Huang","doi":"10.1109/IPDPSW.2014.194","DOIUrl":null,"url":null,"abstract":"As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2014.194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.

查看原文本刊更多论文

在Intel Xeon Phi多核协处理器上训练大规模深度神经网络

作为机器学习研究的一个新领域，深度学习算法引起了研究界的广泛关注。它可能会使人类对数据的认知达到更高的水平。它的无监督预训练步骤使我们能够找到比主成分分析(PCA)方法更好的高维表示或抽象特征。然而，由于它需要对大规模数据进行多层次的训练过程的密集计算，在应用于处理大规模数据时将面临一些问题。序列深度学习算法通常不能在可接受的时间内完成计算。本文提出了一种基于并行方法的多核算法，并将其应用于Intel Xeon Phi多核系统中，以加快稀疏自编码器和受限玻尔兹曼机(RBM)的无监督训练过程。以序列训练算法为基准进行比较，采用几种优化方法对算法进行并行化处理。实验结果表明，在Intel Xeon Phi协处理器上，完全优化后的算法在并行化稀疏自编码器上的速度比原顺序算法提高了300倍以上。此外，我们在Intel Xeon Phi协处理器和昂贵的Intel Xeon CPU上运行了完全优化的代码。我们在英特尔Xeon Phi协处理器上的方法比英特尔Xeon CPU快7到10倍。除此之外，我们还将我们在Intel Xeon Phi上的完全优化代码与在单个Intel Xeon CPU上运行的Matlab代码进行了比较。我们的方法在Intel Xeon Phi处理器上的运行速度比Matlab实现快16倍。该结果还表明，与GPU相比，英特尔至强Phi处理器可以提供一种高效但更通用的方式来并行化深度学习算法。它还实现了比英特尔至强CPU更快的速度和更好的并行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Parallel & Distributed Processing Symposium Workshops

自引率

0.00%

发文量