Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines
{"title":"利用ode系统的流程图加速生物精细神经网络的仿真","authors":"Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines","doi":"10.1109/IPDPS.2019.00028","DOIUrl":null,"url":null,"abstract":"Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Exploiting Flow Graph of System of ODEs to Accelerate the Simulation of Biologically-Detailed Neural Networks\",\"authors\":\"Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines\",\"doi\":\"10.1109/IPDPS.2019.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.\",\"PeriodicalId\":403406,\"journal\":{\"name\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2019.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploiting Flow Graph of System of ODEs to Accelerate the Simulation of Biologically-Detailed Neural Networks
Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.