2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)最新文献

筛选
英文 中文
Improving Inference Latency and Energy of Network-on-Chip based Convolutional Neural Networks through Weights Compression 通过权值压缩提高片上网络卷积神经网络的推理延迟和能量
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00017
G. Ascia, V. Catania, John Jose, Salvatore Monteleone, M. Palesi, Davide Patti
{"title":"Improving Inference Latency and Energy of Network-on-Chip based Convolutional Neural Networks through Weights Compression","authors":"G. Ascia, V. Catania, John Jose, Salvatore Monteleone, M. Palesi, Davide Patti","doi":"10.1109/IPDPSW50202.2020.00017","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00017","url":null,"abstract":"Network-on-Chip (NoC) based Convolutional Neural Network (CNN) accelerators are energy and performance limited by the communication traffic. In fact, to run an inference, the amount of traffic generated both on-chip and off-chip to fetch the parameters of the network, namely, filters and weights, accounts for a large fraction of the energy and latency. This paper presents a technique for compressing the network parameters in such a way to reduce the amount of traffic for fetching the network parameters thus improving the overall performance and energy figures of the accelerator. The lossy nature of the proposed compression technique results in a degradation of the accuracy of the network which we show being, nevertheless, widely justified by the achievable latency and energy consumption improvements. The proposed technique is applied to several widespread CNN models in which the trade-off accuracy vs. inference latency and inference energy is discussed. We show that up to 63% inference latency reduction and 67% inference energy reduction can be achieved with less than 5% top 5 accuracy degradation without the need of retraining the network.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125903255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Teaching Modern Multithreading in CS2 with Actors 用actor教授CS2中的现代多线程
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00061
Mark C. Lewis, Lisa L. Lacher
{"title":"Teaching Modern Multithreading in CS2 with Actors","authors":"Mark C. Lewis, Lisa L. Lacher","doi":"10.1109/IPDPSW50202.2020.00061","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00061","url":null,"abstract":"Explosive growth in multiprocessor computing and the pervasive nature of multicore processors has not only made multithreading and related topics such as parallelism, concurrency, synchronization, etc. an essential part of any undergraduate Computer Science curriculum, it has also lead to the addition of newer constructs to support multithreading in many languages. Not only is it important to motivate student interest in this topic, it is important that they are also educated in current methods used in industry. This can mean an increase in material that needs to be covered. Because of the increase in scope of a CS education, teaching topics in parallel and distributed computing in a hands-on manner is challenging, thus it is valuable for educators to explore different methods of educational delivery in order to best engage their students within the limits of curriculum timelines. The actor model is immensely popular in industry and runs some of the most important software today. In this paper, we describe how we are using Actors as a significant part of the multithreading coverage at the CS2 level, for first-year computer science majors. We also describe a semester-long project that involves the use of these concepts to help solidify student understanding and present student feedback on the project and approach.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126042530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving HLS Generated Accelerators Through Relaxed Memory Access Scheduling 通过放宽内存访问调度改进HLS生成的加速器
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00020
Johanna Rohde, Karsten Müller, C. Hochberger
{"title":"Improving HLS Generated Accelerators Through Relaxed Memory Access Scheduling","authors":"Johanna Rohde, Karsten Müller, C. Hochberger","doi":"10.1109/IPDPSW50202.2020.00020","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00020","url":null,"abstract":"High-Level-Synthesis can be used to generate hardware accelerators for compute intense software parts (so called kernels). For meaningful acceleration, such kernels should be able to autonomously access the memory. Unfortunately, such memory accesses can constitute dependences (e.g. writing an array before reading from it) leading to bottlenecks. The analysis of potential conflicts of memory accesses is often difficult and in many cases not even possible. In order to improve the scheduling of memory accesses, we propose a novel methodology to fully automatically place bypasses and squashes into the data flow graph that is used to generate the hardware accelerator. Evaluating our approach with the Powerstone benchmark suite, we can show that execution time is reduced on average by 6.5%.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125659849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving MPI Application Communication Time with an Introspection Monitoring Library 利用自省监控库改进MPI应用程序通信时间
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00124
E. Jeannot, Richard Sartori
{"title":"Improving MPI Application Communication Time with an Introspection Monitoring Library","authors":"E. Jeannot, Richard Sartori","doi":"10.1109/IPDPSW50202.2020.00124","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00124","url":null,"abstract":"In this paper we describe how to improve communication time of MPI parallel applications with the use of a library that enables to monitor MPI applications and allows for introspection (the program itself can query the state of the monitoring system). Based on previous work, this library is able to see how collective communications are decomposed into point-to-point messages. It also features monitoring sessions that allow suspending and restarting the monitoring, limiting it to specific portions of the code. Experiments show that the monitoring overhead is very small and that the proposed features allow for dynamic and efficient rank reordering enabling up to 2-time reduction of communication parts of some program.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125813539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Message from the EduPar-20 Workshop Chairs 来自edupar20工作坊主席的信息
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/ipdpsw50202.2020.00053
S. Prasad, T. Newhall, David P. Bunde, Martina Barnas, S. Puri
{"title":"Message from the EduPar-20 Workshop Chairs","authors":"S. Prasad, T. Newhall, David P. Bunde, Martina Barnas, S. Puri","doi":"10.1109/ipdpsw50202.2020.00053","DOIUrl":"https://doi.org/10.1109/ipdpsw50202.2020.00053","url":null,"abstract":"Welcome to the NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-20) proceedings. The EduPar-20 workshop, held in conjunction with the IEEE International Parallel and Computing Symposium (IPDPS), is devoted to the development and assessment of educational and curricular innovations and resources for undergraduate and graduate education in Parallel and Distributed Computing (PDC). EduPar brings together individuals from academia, industry, and other educational and research institutes to explore new ideas, challenges, and experiences related to PDC pedagogy and curricula. The workshop is designed in coordination with the IEEE TCPP curriculum initiative on parallel and distributed computing (http://www.cs.gsu.edu/~tcpp/curriculum) for computer science and computer engineering undergraduates, and is supported by the NSF and the NSF-supported Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER).","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125459875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Pass Softmax Algorithm 双通道软最大算法
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00074
Marat Dukhan, Artsiom Ablavatski
{"title":"Two-Pass Softmax Algorithm","authors":"Marat Dukhan, Artsiom Ablavatski","doi":"10.1109/IPDPSW50202.2020.00074","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00074","url":null,"abstract":"The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth.We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the “mantissa” and another representing the “exponent”.Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 17","pages":"386-395"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141207283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Porting a Legacy CUDA Stencil Code to oneAPI 将遗留的CUDA模板代码移植到一个api
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00070
Steffen Christgau, T. Steinke
{"title":"Porting a Legacy CUDA Stencil Code to oneAPI","authors":"Steffen Christgau, T. Steinke","doi":"10.1109/IPDPSW50202.2020.00070","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00070","url":null,"abstract":"Recently, Intel released the oneAPI programming environment. With Data Parallel C++(DPC++), oneAPI enables codes to target multiple hardware architectures like multi-core CPUs, GPUs, and even FPGAs or other hardware using a single source. For legacy codes that were written for Nvidia GPUs, a compatibility tool is provided which facilitates the transition to the SYCL-based DPC++ programming language. This paper presents early experiences when using both the compatibility tool and oneAPI as well the employed extension to the SYCL programming standard for the tsunami simulation code easyWave. A performance study compares the original code running on Xeon processors using OpenMP as well as CUDA with the performance of the DPC++ counter part on multicore CPUs as well as integrated GPUs.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132011267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Analyzing Deep Learning Model Inferences for Image Classification using OpenVINO 基于OpenVINO的图像分类深度学习模型推理分析
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00152
Zheming Jin, H. Finkel
{"title":"Analyzing Deep Learning Model Inferences for Image Classification using OpenVINO","authors":"Zheming Jin, H. Finkel","doi":"10.1109/IPDPSW50202.2020.00152","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00152","url":null,"abstract":"It may be desirable to execute deep learning model inferences on an integrated GPU at the edge. While such GPUs are much less powerful than discrete GPUs, it is able to deliver higher floating-point operations per second than a CPU located on the same die. For edge devices, the benefit of moving to lower precision with minimal loss of accuracy to obtain higher performance is also attractive. Hence, we chose 14 deep learning models for image classification to evaluate their inference performance with the OpenVINO toolkit. Then, we analyzed the implementation of the fastest inference model of all the models. The experimental results are promising. Compared to the performance of full-precision (FP32) models, the speedup of the 8-bit (INT8) quantization ranges from 1.02 to 1.56 on an Intel® Xeon® 4-core CPU, and the speedup of the FP16 models ranges from 1.1 to 2 on an Intel® IrisTM Pro GPU. For the FP32 models, the GPU is on average 1.5X faster than the CPU.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"89 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128004381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Automatic Selection of Tuning Plugins in PTF Using Machine Learning 使用机器学习的PTF自动选择调谐插件
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00069
Robert Mijakovic, M. Gerndt
{"title":"Automatic Selection of Tuning Plugins in PTF Using Machine Learning","authors":"Robert Mijakovic, M. Gerndt","doi":"10.1109/IPDPSW50202.2020.00069","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00069","url":null,"abstract":"Performance tuning of scientific codes often requires tuning many different aspects like vectorization, OpenMP synchronization, MPI communication, and load balancing. The Periscope Tuning Framework (PTF), an online automatic tuning framework, relies on a flexible plugin mechanism providing tuning plugins for different tuning aspects. Individual plugins can be combined for convenience into meta-plugins. Since each plugin can take considerable execution time for testing various combination of the tuning parameters, it is desirable to automatically predict the tuning potential of plugins for programs before their application. We developed a generic automatic prediction mechanism based on machine learning techniques for this purpose. This paper demonstrates this technique in the context of the Compiler Flags Selection plugin, that tunes the parameters of a user specified compiler for a given application.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134237982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Multicore CPU Implementation for Convolution-Pooling Computation in CNNs cnn中卷积池计算的高效多核CPU实现
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI: 10.1109/IPDPSW50202.2020.00097
Hiroki Kataoka, Kohei Yamashita, Yasuaki Ito, K. Nakano, Akihiko Kasagi, T. Tabaru
{"title":"An Efficient Multicore CPU Implementation for Convolution-Pooling Computation in CNNs","authors":"Hiroki Kataoka, Kohei Yamashita, Yasuaki Ito, K. Nakano, Akihiko Kasagi, T. Tabaru","doi":"10.1109/IPDPSW50202.2020.00097","DOIUrl":"https://doi.org/10.1109/IPDPSW50202.2020.00097","url":null,"abstract":"The main contribution of this paper is to present an efficient multicore CPU implementation of convolution-pooling computation in convolutional neural networks (CNNs). Since the convolution and pooling operations are performed several times in most CNNs, we propose a method to accelerate the operations. In our proposed multicore CPU implementation, we use convolution interchange to reduce the computational cost. Also, we implement convolution-pooling computation efficiently using DNNL that is an open source library for accelerating deep learning frameworks. The experimental results using Intel Corei9-7980XE CPU show that our proposed CPU implementation for the convolution-pooling is 1.42 to 2.82 times faster than the multiple convolution and then pooling by DNNL. Further, we incorporate the proposed implementation into TensorFlow to perform them as a TensorFloW operation. The incorporated implementation for the convolution-pooling is 1.18 to 2.42 times faster than straightforward implementation by primitives in TensorFlow.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133306341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信