{"title":"Scalable Hardware Accelerator for Mini-Batch Gradient Descent","authors":"Sandeep Rasoori, V. Akella","doi":"10.1145/3194554.3194559","DOIUrl":null,"url":null,"abstract":"Iterative first-order methods that use gradient information form the core computation kernels of modern statistical data analytic engines, such as MADLib, Impala, Google Brain, GraphLab, MLlib in Spark, among others. Even the most advanced parallel stochastic gradient descent algorithm, such as Hogwild is not very scalable on conventional chip multiprocessors because of the bottlenecks induced by the memory system when sharing large model vectors. We propose a scalable architecture for large scale parallel gradient descent on a Field Programmable Gate Array (FPGA) by taking advantage of the large amount of embedded memory in modern FPGAs. We propose a novel data layout mechanism that eliminates the need for expensive synchronization and locking of shared data, which makes the architecture scalable. A 32-PE system on the Stratix V FPGA shows about 5x improvement in performance compared to state-of-the-art implementation on a 14 core/28 thread Intel Xeon CPU with 64 GB memory and operating at 2.6 GHz.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"2015 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3194554.3194559","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Iterative first-order methods that use gradient information form the core computation kernels of modern statistical data analytic engines, such as MADLib, Impala, Google Brain, GraphLab, MLlib in Spark, among others. Even the most advanced parallel stochastic gradient descent algorithm, such as Hogwild is not very scalable on conventional chip multiprocessors because of the bottlenecks induced by the memory system when sharing large model vectors. We propose a scalable architecture for large scale parallel gradient descent on a Field Programmable Gate Array (FPGA) by taking advantage of the large amount of embedded memory in modern FPGAs. We propose a novel data layout mechanism that eliminates the need for expensive synchronization and locking of shared data, which makes the architecture scalable. A 32-PE system on the Stratix V FPGA shows about 5x improvement in performance compared to state-of-the-art implementation on a 14 core/28 thread Intel Xeon CPU with 64 GB memory and operating at 2.6 GHz.