2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献_第9页

Experimental Applications on SRAM-Based FPGA for the NanosatC-BR2 Scientific Mission 基于sram的FPGA在纳米卫星c - br2科学任务中的实验应用

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00032

F. Benevenuti, E. Chielle, Jorge Tonfat, L. Tambara, F. Kastensmidt, Carlos A. Zaffari, João Baptista dos Santos Martins, O. Durão

{"title":"Experimental Applications on SRAM-Based FPGA for the NanosatC-BR2 Scientific Mission","authors":"F. Benevenuti, E. Chielle, Jorge Tonfat, L. Tambara, F. Kastensmidt, Carlos A. Zaffari, João Baptista dos Santos Martins, O. Durão","doi":"10.1109/IPDPSW.2019.00032","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00032","url":null,"abstract":"The use of reconfigurable devices, such as FPGAs, in nanosatellites allows the prototyping and evaluation in flight of different categories of designs of interest to the aerospace technology. It includes blending of experimental or well-proven legacy software executing on microprocessors with out-of-core accelerators and dedicated logic circuits, or even the conversion of such software to logic circuits using high-level synthesis (HLS). An additional feature discussed in this work, which is relevant to the scientific mission of the NanosatC-BR2 nanosatellite, is the use of SRAM-based FPGA as radiation particle sensor exploiting the susceptibility of SRAM memory to bit-flips caused by radiation. The process for bit-flip recording by bitstream readback is presented as well as a set of experimental designs implemented on the FPGA for data processing. As the status of these experimental designs must be reliably tracked by a supervisory circuit implemented on the same SRAM-based FPGA, errors caused by the bit-flips must be considered. Mitigation using triple modular redundancy (TMR) is analyzed using fault injection, suggesting that a fine grain distributed TMR approach can increase mission time of the supervisory module by 8x at a target reliability of 95%, but with a penalty of 40% in the estimated total power consumption of the FPGA. Conversely, a blockwise TMR approach can increase mission time of the supervisory module by 6x at the same target reliability with no increase in the estimated total power consumption.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124266525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Message from the Workshops Chair and Vice Chair 研讨会主席和副主席的讲话

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/ipdpsw.2019.00006

Cynthia A. Philips, S. Rajamanickam

引用次数: 0

Compression of Deep Neural Networks by Combining Pruning and Low Rank Decomposition 结合剪枝和低秩分解的深度神经网络压缩

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00162

Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma

引用次数: 6

FPGA-Assisted Deterministic Routing for FPGAs fpga辅助的确定性路由算法

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00034

Dario Korolija, Mirjana Stojilović

{"title":"FPGA-Assisted Deterministic Routing for FPGAs","authors":"Dario Korolija, Mirjana Stojilović","doi":"10.1109/IPDPSW.2019.00034","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00034","url":null,"abstract":"FPGA routing is one of the most time-consuming steps of FPGA compilation, often preventing fast edit-compiletest cycles in prototyping and development. There have been attempts to accelerate FPGA routing using algorithmic improvements, multi-core or multi-CPU platforms. Instead, we propose porting FPGA routing to a CPU+FPGA platform. Motivated by the approaches used in FPGA-accelerated graph processing, we propose and implement three acceleration strategies: (1) reducing the number of expensive random memory accesses, (2) parallel and pipelined computation, and (3) efficient hardware priority queues. To test and evaluate the router performance, we implement it on DE1-SoC, a mid-end ARM+FPGA platform of Intel. Our router works and produces good quality results. Moreover, we succeed in accelerating the software router running on the embedded ARM. However, when compared to the latest VPR router running on a powerful Intel Core-i5 CPU, our CPU+FPGA router is slower. This is not unexpected, given the limited performance of the chosen hardware platform. Since this design can easily be ported to newer and higher-end CPU+FPGA systems, we estimate the performance it could achieve; the results indicate that a non-negligible speedup over the software-only router could indeed be obtained.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124660088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards Native Execution of Deep Learning on a Leadership-Class HPC System 在领导力级高性能计算系统上实现深度学习

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00160

Srikanth B. Yoginath, M. Alam, A. Ramanathan, D. Bhowmik, N. Laanait, K. Perumalla

{"title":"Towards Native Execution of Deep Learning on a Leadership-Class HPC System","authors":"Srikanth B. Yoginath, M. Alam, A. Ramanathan, D. Bhowmik, N. Laanait, K. Perumalla","doi":"10.1109/IPDPSW.2019.00160","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00160","url":null,"abstract":"Large parallel machines generally offer the best parallel performance with \"native execution\" that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called \"learning speed up\" is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114597269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Performance Analysis of Large Scale Scientific Computing Applications from Log Archives 基于日志档案的大规模科学计算应用性能分析

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00079

Liqiang Cao, X. Liu, Xiaowen Xu, Zhanjun Liu

引用次数: 0

Toward an Analytical Performance Model to Select between GPU and CPU Execution 在GPU和CPU执行之间选择的分析性能模型

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00068

Artem Chikin, J. N. Amaral, Karim Ali, Ettore Tiotto

{"title":"Toward an Analytical Performance Model to Select between GPU and CPU Execution","authors":"Artem Chikin, J. N. Amaral, Karim Ali, Ettore Tiotto","doi":"10.1109/IPDPSW.2019.00068","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00068","url":null,"abstract":"Automating the device selection in heterogeneous computing platforms requires the modelling of performance both on CPUs and on accelerators. This work argues for the use of a hybrid analytical performance modelling approach is a practical way to build fast and efficient methods to select an appropriate target for a given computation kernel. The target selection problem has been addressed in the literature, however there has been a strong emphasis on building empirical models with machine learning techniques. We argue that the applicability of such solutions is often limited in production systems. This paper focus on the issue of building a selector to decide if an OpenMP loop nest should be executed in a CPU or in a GPU. To this end, it offers a comprehensive comparison evaluation of the difference in GPU kernel performance on devices of multiple generations of architectures. The idea is to underscore the need for accurate analytical performance models and to provide insights in the evolution of GPU accelerators. This work also highlights a drawback of existing approaches to modelling GPU performance — accurate modelling of memory coalescing characteristics. To that end, we examine a novel application of an inter-thread difference analysis that can further improve analytical models. Finally, this work presents an initial study of an OpenMP runtime framework for target-offloading target selection.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Efficient Conversion of Boolean Circuits to Nondeterministic Branching Programs 布尔电路到不确定分支规划的有效转换

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00111

Y. Ben-Asher, V. Tartakovsky

引用次数: 0

EduPar Posters

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00065

Deepak Aggarwal, Fei Cao, Harish Charan, D. Deb, Dabin Ding, Toby Dragon, M. Fuad, Prashant Kumar, Hemant Joshi, Anthony Moore, Justin Y. Shi, Mengxia Zhu, Martina Barnas, N. Rodriguez

引用次数: 1

SummaGen: Parallel Matrix-Matrix Multiplication Based on Non-rectangular Partitions for Heterogeneous HPC Platforms 基于非矩形分区的异构HPC平台并行矩阵-矩阵乘法

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2019.00017

Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky

{"title":"SummaGen: Parallel Matrix-Matrix Multiplication Based on Non-rectangular Partitions for Heterogeneous HPC Platforms","authors":"Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky","doi":"10.1109/IPDPSW.2019.00017","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00017","url":null,"abstract":"Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128005046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1