G. Ascia, V. Catania, John Jose, Salvatore Monteleone, M. Palesi, Davide Patti
{"title":"Improving Inference Latency and Energy of Network-on-Chip based Convolutional Neural Networks through Weights Compression","authors":"G. Ascia, V. Catania, John Jose, Salvatore Monteleone, M. Palesi, Davide Patti","doi":"10.1109/IPDPSW50202.2020.00017","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00017","url":null,"abstract":"Network-on-Chip (NoC) based Convolutional Neural Network (CNN) accelerators are energy and performance limited by the communication traffic. In fact, to run an inference, the amount of traffic generated both on-chip and off-chip to fetch the parameters of the network, namely, filters and weights, accounts for a large fraction of the energy and latency. This paper presents a technique for compressing the network parameters in such a way to reduce the amount of traffic for fetching the network parameters thus improving the overall performance and energy figures of the accelerator. The lossy nature of the proposed compression technique results in a degradation of the accuracy of the network which we show being, nevertheless, widely justified by the achievable latency and energy consumption improvements. The proposed technique is applied to several widespread CNN models in which the trade-off accuracy vs. inference latency and inference energy is discussed. We show that up to 63% inference latency reduction and 67% inference energy reduction can be achieved with less than 5% top 5 accuracy degradation without the need of retraining the network.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125903255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching Modern Multithreading in CS2 with Actors","authors":"Mark C. Lewis, Lisa L. Lacher","doi":"10.1109/IPDPSW50202.2020.00061","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00061","url":null,"abstract":"Explosive growth in multiprocessor computing and the pervasive nature of multicore processors has not only made multithreading and related topics such as parallelism, concurrency, synchronization, etc. an essential part of any undergraduate Computer Science curriculum, it has also lead to the addition of newer constructs to support multithreading in many languages. Not only is it important to motivate student interest in this topic, it is important that they are also educated in current methods used in industry. This can mean an increase in material that needs to be covered. Because of the increase in scope of a CS education, teaching topics in parallel and distributed computing in a hands-on manner is challenging, thus it is valuable for educators to explore different methods of educational delivery in order to best engage their students within the limits of curriculum timelines. The actor model is immensely popular in industry and runs some of the most important software today. In this paper, we describe how we are using Actors as a significant part of the multithreading coverage at the CS2 level, for first-year computer science majors. We also describe a semester-long project that involves the use of these concepts to help solidify student understanding and present student feedback on the project and approach.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126042530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving HLS Generated Accelerators Through Relaxed Memory Access Scheduling","authors":"Johanna Rohde, Karsten Müller, C. Hochberger","doi":"10.1109/IPDPSW50202.2020.00020","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00020","url":null,"abstract":"High-Level-Synthesis can be used to generate hardware accelerators for compute intense software parts (so called kernels). For meaningful acceleration, such kernels should be able to autonomously access the memory. Unfortunately, such memory accesses can constitute dependences (e.g. writing an array before reading from it) leading to bottlenecks. The analysis of potential conflicts of memory accesses is often difficult and in many cases not even possible. In order to improve the scheduling of memory accesses, we propose a novel methodology to fully automatically place bypasses and squashes into the data flow graph that is used to generate the hardware accelerator. Evaluating our approach with the Powerstone benchmark suite, we can show that execution time is reduced on average by 6.5%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125659849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving MPI Application Communication Time with an Introspection Monitoring Library","authors":"E. Jeannot, Richard Sartori","doi":"10.1109/IPDPSW50202.2020.00124","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00124","url":null,"abstract":"In this paper we describe how to improve communication time of MPI parallel applications with the use of a library that enables to monitor MPI applications and allows for introspection (the program itself can query the state of the monitoring system). Based on previous work, this library is able to see how collective communications are decomposed into point-to-point messages. It also features monitoring sessions that allow suspending and restarting the monitoring, limiting it to specific portions of the code. Experiments show that the monitoring overhead is very small and that the proposed features allow for dynamic and efficient rank reordering enabling up to 2-time reduction of communication parts of some program.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Prasad, T. Newhall, David P. Bunde, Martina Barnas, S. Puri
{"title":"Message from the EduPar-20 Workshop Chairs","authors":"S. Prasad, T. Newhall, David P. Bunde, Martina Barnas, S. Puri","doi":"10.1109/ipdpsw50202.2020.00053","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00053","url":null,"abstract":"Welcome to the NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-20) proceedings. The EduPar-20 workshop, held in conjunction with the IEEE International Parallel and Computing Symposium (IPDPS), is devoted to the development and assessment of educational and curricular innovations and resources for undergraduate and graduate education in Parallel and Distributed Computing (PDC). EduPar brings together individuals from academia, industry, and other educational and research institutes to explore new ideas, challenges, and experiences related to PDC pedagogy and curricula. The workshop is designed in coordination with the IEEE TCPP curriculum initiative on parallel and distributed computing (http://www.cs.gsu.edu/~tcpp/curriculum) for computer science and computer engineering undergraduates, and is supported by the NSF and the NSF-supported Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125459875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-Pass Softmax Algorithm","authors":"Marat Dukhan, Artsiom Ablavatski","doi":"10.1109/IPDPSW50202.2020.00074","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00074","url":null,"abstract":"The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth.We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the “mantissa” and another representing the “exponent”.Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 17","pages":"386-395"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141207283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Porting a Legacy CUDA Stencil Code to oneAPI","authors":"Steffen Christgau, T. Steinke","doi":"10.1109/IPDPSW50202.2020.00070","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00070","url":null,"abstract":"Recently, Intel released the oneAPI programming environment. With Data Parallel C++(DPC++), oneAPI enables codes to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs or other hardware using a single source. For legacy codes that were written for Nvidia GPUs, a compatibility tool is provided which facilitates the transition to the SYCL-based DPC++ programming language. This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave. A performance study compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part on multicore CPUs as well as integrated GPUs.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132011267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Deep Learning Model Inferences for Image Classification using OpenVINO","authors":"Zheming Jin, H. Finkel","doi":"10.1109/IPDPSW50202.2020.00152","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00152","url":null,"abstract":"It may be desirable to execute deep learning model inferences on an integrated GPU at the edge. While such GPUs are much less powerful than discrete GPUs, it is able to deliver higher floating-point operations per second than a CPU located on the same die. For edge devices, the benefit of moving to lower precision with minimal loss of accuracy to obtain higher performance is also attractive. Hence, we chose 14 deep learning models for image classification to evaluate their inference performance with the OpenVINO toolkit. Then, we analyzed the implementation of the fastest inference model of all the models. The experimental results are promising. Compared to the performance of full-precision (FP32) models, the speedup of the 8-bit (INT8) quantization ranges from 1.02 to 1.56 on an Intel® Xeon® 4-core CPU, and the speedup of the FP16 models ranges from 1.1 to 2 on an Intel® IrisTM Pro GPU. For the FP32 models, the GPU is on average 1.5X faster than the CPU.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"89 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128004381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Selection of Tuning Plugins in PTF Using Machine Learning","authors":"Robert Mijakovic, M. Gerndt","doi":"10.1109/IPDPSW50202.2020.00069","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00069","url":null,"abstract":"Performance tuning of scientific codes often requires tuning many different aspects like vectorization, OpenMP synchronization, MPI communication, and load balancing. The Periscope Tuning Framework (PTF), an online automatic tuning framework, relies on a flexible plugin mechanism providing tuning plugins for different tuning aspects. Individual plugins can be combined for convenience into meta-plugins. Since each plugin can take considerable execution time for testing various combination of the tuning parameters, it is desirable to automatically predict the tuning potential of plugins for programs before their application. We developed a generic automatic prediction mechanism based on machine learning techniques for this purpose. This paper demonstrates this technique in the context of the Compiler Flags Selection plugin, that tunes the parameters of a user specified compiler for a given application.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134237982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroki Kataoka, Kohei Yamashita, Yasuaki Ito, K. Nakano, Akihiko Kasagi, T. Tabaru
{"title":"An Efficient Multicore CPU Implementation for Convolution-Pooling Computation in CNNs","authors":"Hiroki Kataoka, Kohei Yamashita, Yasuaki Ito, K. Nakano, Akihiko Kasagi, T. Tabaru","doi":"10.1109/IPDPSW50202.2020.00097","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00097","url":null,"abstract":"The main contribution of this paper is to present an efficient multicore CPU implementation of convolution-pooling computation in convolutional neural networks (CNNs). Since the convolution and pooling operations are performed several times in most CNNs, we propose a method to accelerate the operations. In our proposed multicore CPU implementation, we use convolution interchange to reduce the computational cost. Also, we implement convolution-pooling computation efficiently using DNNL that is an open source library for accelerating deep learning frameworks. The experimental results using Intel Corei9-7980XE CPU show that our proposed CPU implementation for the convolution-pooling is 1.42 to 2.82 times faster than the multiple convolution and then pooling by DNNL. Further, we incorporate the proposed implementation into TensorFlow to perform them as a TensorFloW operation. The incorporated implementation for the convolution-pooling is 1.18 to 2.42 times faster than straightforward implementation by primitives in TensorFlow.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}