F. Benevenuti, E. Chielle, Jorge Tonfat, L. Tambara, F. Kastensmidt, Carlos A. Zaffari, João Baptista dos Santos Martins, O. Durão
{"title":"Experimental Applications on SRAM-Based FPGA for the NanosatC-BR2 Scientific Mission","authors":"F. Benevenuti, E. Chielle, Jorge Tonfat, L. Tambara, F. Kastensmidt, Carlos A. Zaffari, João Baptista dos Santos Martins, O. Durão","doi":"10.1109/IPDPSW.2019.00032","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00032","url":null,"abstract":"The use of reconfigurable devices, such as FPGAs, in nanosatellites allows the prototyping and evaluation in flight of different categories of designs of interest to the aerospace technology. It includes blending of experimental or well-proven legacy software executing on microprocessors with out-of-core accelerators and dedicated logic circuits, or even the conversion of such software to logic circuits using high-level synthesis (HLS). An additional feature discussed in this work, which is relevant to the scientific mission of the NanosatC-BR2 nanosatellite, is the use of SRAM-based FPGA as radiation particle sensor exploiting the susceptibility of SRAM memory to bit-flips caused by radiation. The process for bit-flip recording by bitstream readback is presented as well as a set of experimental designs implemented on the FPGA for data processing. As the status of these experimental designs must be reliably tracked by a supervisory circuit implemented on the same SRAM-based FPGA, errors caused by the bit-flips must be considered. Mitigation using triple modular redundancy (TMR) is analyzed using fault injection, suggesting that a fine grain distributed TMR approach can increase mission time of the supervisory module by 8x at a target reliability of 95%, but with a penalty of 40% in the estimated total power consumption of the FPGA. Conversely, a blockwise TMR approach can increase mission time of the supervisory module by 6x at the same target reliability with no increase in the estimated total power consumption.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124266525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the Workshops Chair and Vice Chair","authors":"Cynthia A. Philips, S. Rajamanickam","doi":"10.1109/ipdpsw.2019.00006","DOIUrl":"https://doi.org/10.1109/ipdpsw.2019.00006","url":null,"abstract":"Welcome to IEEE IPDPS 2017 in Orlando and in particular to its workshops. Normally, IPDPS workshops are held on Monday preceding the main events, and on Friday following the main events. This year we have 23 workshops for these two days. We also have one workshop on Sunday. All these events offer 184 peer-reviewed papers, invited talks, posters and a great number of participants to interact. We are looking forward to the workshops.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126444933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma
{"title":"Compression of Deep Neural Networks by Combining Pruning and Low Rank Decomposition","authors":"Saurabh Goyal, Anamitra R. Choudhury, Vivek Sharma","doi":"10.1109/IPDPSW.2019.00162","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00162","url":null,"abstract":"Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as \"inferencing as a service\" environments on the cloud. Prior work has considered reduction in the size of the models, through compression techniques like weight pruning, filter pruning, etc. or through low-rank decomposition of the convolution layers. In this paper, we demonstrate the use of multiple techniques to achieve not only higher model compression but also reduce the compute resources required during inferencing. We do filter pruning followed by low-rank decomposition using Tucker decomposition for model compression. We show that our approach achieves up to 57% higher model compression when compared to either Tucker Decomposition or Filter pruning alone at similar accuracy for GoogleNet. Also, it reduces the Flops by up to 48% thereby making the inferencing faster.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134467152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-Assisted Deterministic Routing for FPGAs","authors":"Dario Korolija, Mirjana Stojilović","doi":"10.1109/IPDPSW.2019.00034","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00034","url":null,"abstract":"FPGA routing is one of the most time-consuming steps of FPGA compilation, often preventing fast edit-compiletest cycles in prototyping and development. There have been attempts to accelerate FPGA routing using algorithmic improvements, multi-core or multi-CPU platforms. Instead, we propose porting FPGA routing to a CPU+FPGA platform. Motivated by the approaches used in FPGA-accelerated graph processing, we propose and implement three acceleration strategies: (1) reducing the number of expensive random memory accesses, (2) parallel and pipelined computation, and (3) efficient hardware priority queues. To test and evaluate the router performance, we implement it on DE1-SoC, a mid-end ARM+FPGA platform of Intel. Our router works and produces good quality results. Moreover, we succeed in accelerating the software router running on the embedded ARM. However, when compared to the latest VPR router running on a powerful Intel Core-i5 CPU, our CPU+FPGA router is slower. This is not unexpected, given the limited performance of the chosen hardware platform. Since this design can easily be ported to newer and higher-end CPU+FPGA systems, we estimate the performance it could achieve; the results indicate that a non-negligible speedup over the software-only router could indeed be obtained.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124660088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srikanth B. Yoginath, M. Alam, A. Ramanathan, D. Bhowmik, N. Laanait, K. Perumalla
{"title":"Towards Native Execution of Deep Learning on a Leadership-Class HPC System","authors":"Srikanth B. Yoginath, M. Alam, A. Ramanathan, D. Bhowmik, N. Laanait, K. Perumalla","doi":"10.1109/IPDPSW.2019.00160","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00160","url":null,"abstract":"Large parallel machines generally offer the best parallel performance with \"native execution\" that is achieved using codes developed with the optimized compilers, communication libraries, and runtimes offered on the machines. In this paper, we report and analyze performance results from native execution of deep learning on a leadership-class high-performance computing (HPC) system. Using our new code called DeepEx, we present a study of the parallel speed up and convergence rates of learning achieved with native parallel execution. In the trade-off between computational parallelism and synchronized convergence, we first focus on maximizing parallelism while still obtaining convergence. Scaling results are reported from execution on up to 15,000 GPUs using two scientific data sets from atom microscopy and protein folding applications, and also using the popular ImageNet data set. In terms of the traditional measure of parallel speed up, excellent scaling is observed up to 12,000 GPUs. Additionally, accounting for convergence rates of deep learning accuracy or error, a deep learning-specific metric called \"learning speed up\" is also tracked. The performance results indicate the need to evaluate parallel deep learning execution in terms of learning speed up, and point to additional directions for improved exploitation of high-end HPC systems.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114597269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Performance Analysis of Large Scale Scientific Computing Applications from Log Archives","authors":"Liqiang Cao, X. Liu, Xiaowen Xu, Zhanjun Liu","doi":"10.1109/IPDPSW.2019.00079","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00079","url":null,"abstract":"A log archive for scientific computing applications is a set of logs for model and time of jobs in HPCs. We have developed light weight and fast performance analysis tools on top of log archives. We classify the job logs based on the similarity of the input models to form a model-based tree like archive. With linear regression, we analyze the relations of the step time of the jobs with the parameters in the model. We found that although there is some disturbance, the performance of most of the jobs showed good regularity. In one of the applications, we found the step time of job changes proportionally to the geometric parameters of model. And the most significant physical parameter determines step time up to 1.7 times. In another application, we find that the performance of each step scales 1.59 times with the number of process scales from 384 to 768.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121792237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Artem Chikin, J. N. Amaral, Karim Ali, Ettore Tiotto
{"title":"Toward an Analytical Performance Model to Select between GPU and CPU Execution","authors":"Artem Chikin, J. N. Amaral, Karim Ali, Ettore Tiotto","doi":"10.1109/IPDPSW.2019.00068","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00068","url":null,"abstract":"Automating the device selection in heterogeneous computing platforms requires the modelling of performance both on CPUs and on accelerators. This work argues for the use of a hybrid analytical performance modelling approach is a practical way to build fast and efficient methods to select an appropriate target for a given computation kernel. The target selection problem has been addressed in the literature, however there has been a strong emphasis on building empirical models with machine learning techniques. We argue that the applicability of such solutions is often limited in production systems. This paper focus on the issue of building a selector to decide if an OpenMP loop nest should be executed in a CPU or in a GPU. To this end, it offers a comprehensive comparison evaluation of the difference in GPU kernel performance on devices of multiple generations of architectures. The idea is to underscore the need for accurate analytical performance models and to provide insights in the evolution of GPU accelerators. This work also highlights a drawback of existing approaches to modelling GPU performance — accurate modelling of memory coalescing characteristics. To that end, we examine a novel application of an inter-thread difference analysis that can further improve analytical models. Finally, this work presents an initial study of an OpenMP runtime framework for target-offloading target selection.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125167775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Conversion of Boolean Circuits to Nondeterministic Branching Programs","authors":"Y. Ben-Asher, V. Tartakovsky","doi":"10.1109/IPDPSW.2019.00111","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00111","url":null,"abstract":"Two models to realize boolean functions exist: Boolean circuits (BCs) a DAG of and/or/not-gates and Branching programs (BPs) a network of switching nodes wherein signals propagate through the switched nodes. Evaluation of BCs is inherently sequential (Based on the common belief that P neq NC) while BPs can be evaluated in parallel by verifying connectivity between the source and the sync nodes of an equivalent BP. This suggests a way to parallelize or evaluate in parallel inherently sequential computations (ISCs) by compiling them to BCs and then convert them to BPs. Our results suggest that BCs emanating from real computations can be converted to-BPs with no size blowup compare to the size of the original BC and in fact have a smaller size compared to the size of the original BCs.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115109780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deepak Aggarwal, Fei Cao, Harish Charan, D. Deb, Dabin Ding, Toby Dragon, M. Fuad, Prashant Kumar, Hemant Joshi, Anthony Moore, Justin Y. Shi, Mengxia Zhu, Martina Barnas, N. Rodriguez
{"title":"EduPar Posters","authors":"Deepak Aggarwal, Fei Cao, Harish Charan, D. Deb, Dabin Ding, Toby Dragon, M. Fuad, Prashant Kumar, Hemant Joshi, Anthony Moore, Justin Y. Shi, Mengxia Zhu, Martina Barnas, N. Rodriguez","doi":"10.1109/IPDPSW.2019.00065","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00065","url":null,"abstract":"This paper provides an overview of the five posters accepted for the EduPar'19 poster session. The poster session has proved to be an important opportunity for interaction among the community, fostering the discussion of innovative approaches and ideas that are under development.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127175181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky
{"title":"SummaGen: Parallel Matrix-Matrix Multiplication Based on Non-rectangular Partitions for Heterogeneous HPC Platforms","authors":"Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy, Alexey L. Lastovetsky","doi":"10.1109/IPDPSW.2019.00017","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00017","url":null,"abstract":"Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128005046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}