Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.88

Andrew Nere, Atif Hashmi, Mikko H. Lipasti

{"title":"Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms","authors":"Andrew Nere, Atif Hashmi, Mikko H. Lipasti","doi":"10.1109/IPDPS.2011.88","DOIUrl":null,"url":null,"abstract":"Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today's GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime profiling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the profiling tool with these optimizations on Nvidia's CUDA framework, we achieve up to 60x speedup over a single-threaded CPU implementation of the model.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.88","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today's GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime profiling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the profiling tool with these optimizations on Nvidia's CUDA framework, we achieve up to 60x speedup over a single-threaded CPU implementation of the model.

查看原文本刊更多论文

分析异构多gpu系统加速皮质启发学习算法

神经科学的最新进展使模拟人类新皮层的并行计算设备成为一种可行的、有吸引力的、容错的、节能的可能性。这些特性再次激发了人们对创建学习算法的兴趣，这些算法渴望对大脑的许多能力进行逆向工程。在这篇论文中，我们描述了一个由哺乳动物新皮层的结构和功能特性启发的智能学习模型的gpgpu加速扩展。我们的皮质网络，就像大脑一样，显示出大量的处理并行性，这使得今天的gpgpu成为一个非常有吸引力和容易获得的硬件加速器。此外，我们考虑了初始设计固有的两个低效率:多个内核启动开销和GPGPU资源的低利用率。我们建议优化，如软件工作队列结构和管道皮层网络的分层层，以减轻这些问题。我们的分析提供了对GPU架构细节的重要见解，包括内核数量、内存系统和全局线程调度器。此外，我们为我们的并行学习算法创建了一个运行时分析工具，该工具将皮质网络按比例分布在主机CPU和多个gpu上，无论是同构的还是异构的，这可能对系统可用。在Nvidia的CUDA框架上使用这些优化的分析工具，我们在单线程CPU实现的模型上实现了高达60倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量