2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

筛选
英文 中文
PSU: A Framework for Dynamic Software Updates in Multi-threaded C-Language Programs PSU:多线程c语言程序动态软件更新框架
Marcus Karpoff, J. N. Amaral, Kai-Ting Amy Wang, Rayson Ho, B. Dobry
{"title":"PSU: A Framework for Dynamic Software Updates in Multi-threaded C-Language Programs","authors":"Marcus Karpoff, J. N. Amaral, Kai-Ting Amy Wang, Rayson Ho, B. Dobry","doi":"10.1109/SBAC-PAD49847.2020.00040","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00040","url":null,"abstract":"A Dynamic Software Update (DSU) system enables an operator to modify a running program without interrupting its execution. However, creating a DSU system to allow programs written in the C programming language to be modified while they are executing is challenging. This paper presents the Portable Software Update (PSU) system, a new framework that allows the creation of C-language DSU programs. PSU offers a simple programming interface to build DSU versions of existing C programs. Once a program is built using PSU, updates can be applied by background threads that have negligible impact on the execution of the program. PSU supports multi-threaded and recursive programs without the use of safe points or thread blocking. PSU uses function indirection to redirect DSU functions calls to the newest version of the function code. Once a DSU function is invoked in a PSU program, it executes to completion using the version of the function that was active when it was invoked. However, if a new version is installed, any future calls to the same function always execute the newest version. This simple mechanism allows for quick loading of updates in PSU. PSU unloads obsolete version of DSU functions after they are no longer executing. This mechanism makes PSU the first DSU system for C-language programs that is able to unload older versions of code. This efficient use of resources enables many patches to be applied to a long-running application. A suite of specialized custom synthetic programs, and a DSU-enabled version of the MySQL database storage engine, are used to evaluate the overhead of the DSU-enabling features. The MySQL storage engine maintains over 95% of the performance of the non-DSU version and allows the entire storage engine to be updated while the database continues executing. PSU includes a simple and straightforward process for the modification of the storage engine that enables DSU.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121244916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FFT Optimizations and Performance Assessment Targeted towards Satellite and Airborne Radar Processing 针对卫星和机载雷达处理的FFT优化和性能评估
Maron Schlemon, J. Naghmouchi
{"title":"FFT Optimizations and Performance Assessment Targeted towards Satellite and Airborne Radar Processing","authors":"Maron Schlemon, J. Naghmouchi","doi":"10.1109/SBAC-PAD49847.2020.00050","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00050","url":null,"abstract":"Following the re-invention of the FFT algorithm by Cooley and Tukey in 1965, a lot of effort has been invested into optimization of this algorithm and all its variations. In this paper, we discuss its use and optimization for current and future radar applications, and give a brief survey on implementations that have claimed relatively high advantages in terms of performance over existing solutions. Correspondingly, we present an in-depth analysis of state-ofthe-art solutions and our own implementation that will allow the reader to evaluate the performance improvements on a fair basis. Therefore, we discuss the development of a highperformance Fast Fourier Transform (FFT) using an enhanced Radix-4 decimation in frequency (DIF) algorithm, compare it against the Fastest Fourier Transform in the West (FFTW) autotuned library as well as other solutions and frameworks.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126735839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Predicting the Energy Consumption of CUDA Kernels using SimGrid 使用SimGrid预测CUDA内核的能耗
Dorra Boughzala, L. Lefèvre, Anne-Cécile Orgerie
{"title":"Predicting the Energy Consumption of CUDA Kernels using SimGrid","authors":"Dorra Boughzala, L. Lefèvre, Anne-Cécile Orgerie","doi":"10.1109/SBAC-PAD49847.2020.00035","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00035","url":null,"abstract":"Building a sustainable Exascale machine is a very promising target in High Performance Computing (HPC). To tackle the energy consumption challenge while continuing to provide tremendous performance, the HPC community have rapidly adopted GPU-based systems. Today, GPUs have became the most prevailing components in the massively parallel HPC landscape thanks to their high computational power and energy efficiency. Modeling the energy consumption of applications running on GPUs has gained a lot of attention for the last years. Alas, the HPC community lacks simple yet accurate simulators to predict the energy consumption of general purpose GPU applications. In this work, we address the prediction of the energy consumption of CUDA kernels via simulation. We propose in this paper a simple and lightweight energy model that we implemented using the open-source framework SimGrid. Our proposed model is validated across a diverse set of CUDA kernels and on two different NVIDIA GPUs (Tesla M2075 and Kepler K20Xm). As our modeling approach is not based on performance counters or detailed-architecture parameters, we believe that our model can be easily approved by users who take care of the energy consumption of their GPGPU applications.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121827571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploiting Non-conventional DVFS on GPUs: Application to Deep Learning 利用gpu上的非常规DVFS:在深度学习中的应用
Francisco Mendes, P. Tomás, N. Roma
{"title":"Exploiting Non-conventional DVFS on GPUs: Application to Deep Learning","authors":"Francisco Mendes, P. Tomás, N. Roma","doi":"10.1109/SBAC-PAD49847.2020.00012","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00012","url":null,"abstract":"The use of Graphics Processing Units (GPUs) to accelerate Deep Neural Networks (DNNs) training and inference is already widely adopted, allowing for a significant increase in the performance of these applications. However, this increase in performance comes at the cost of a consequent increase in energy consumption. While several solutions have been proposed to perform Voltage-Frequency (V-F) scaling on GPUs, these are still one-dimensional, by simply adjusting frequency while relying on default voltage settings. To overcome this, this paper introduces a methodology to fully characterize the impact of non-conventional Dynamic Voltage and Frequency Scaling (DVFS) in GPUs. The proposed approach was applied to an AMD Vega 10 Frontier Edition GPU. When applying this non-conventional DVFS scheme to DNNs, the obtained results show that it is possible to safely decrease the GPU voltage, allowing for a significant reduction of the energy consumption (up to 38%) and the Energy-Delay Product (EDP) (up to 41%) on the training of CNN models, with no degradation of the networks accuracy.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128546352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems 关于内存未充分利用:探索HPC系统上的分解内存
I. Peng, R. Pearce, M. Gokhale
{"title":"On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems","authors":"I. Peng, R. Pearce, M. Gokhale","doi":"10.1109/SBAC-PAD49847.2020.00034","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00034","url":null,"abstract":"Large-scale high-performance computing (HPC) systems consist of massive compute and memory resources tightly coupled in nodes. We perform a large-scale study of memory utilization on four production HPC clusters. Our results show that more than 90% of jobs utilize less than 15% of the node memory capacity, and for 90% of the time, memory utilization is less than 35%. Recently, disaggregated architecture is gaining traction because it can selectively scale up a resource and improve resource utilization. Based on these observations, we explore using disaggregated memory to support memory-intensive applications, while most jobs remain intact on HPC systems with reduced node memory. We designed and developed a user-space remote-memory paging library to enable applications exploring disaggregated memory on existing HPC clusters. We quantified the impact of access patterns and network connectivity in benchmarks. Our case studies of graph-processing and Monte-Carlo applications evaluated the impact of application characteristics and local memory capacity and highlighted the potential of throughput scaling on disaggregated memory.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"48 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127652518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Optically Connected Memory for Disaggregated Data Centers 用于分散数据中心的光连接存储器
Jorge González, A. Gazman, Maarten Hattink, Mauricio G. Palma, M. Bahadori, Ruth E. Rubio-Noriega, Lois Orosa, M. Glick, O. Mutlu, K. Bergman, R. Azevedo
{"title":"Optically Connected Memory for Disaggregated Data Centers","authors":"Jorge González, A. Gazman, Maarten Hattink, Mauricio G. Palma, M. Bahadori, Ruth E. Rubio-Noriega, Lois Orosa, M. Glick, O. Mutlu, K. Bergman, R. Azevedo","doi":"10.1109/SBAC-PAD49847.2020.00017","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00017","url":null,"abstract":"Recent advances in integrated photonics enable the implementation of reconfigurable, high-bandwidth, and low energy-per-bit interconnects in next-generation data centers. We propose and evaluate an Optically Connected Memory (OCM) architecture that disaggregates the main memory from the computation nodes in data centers. OCM is based on micro-ring resonators (MRRs), and it does not require any modification to the DRAM memory modules. We calculate energy consumption from real photonic devices and integrate them into a system simulator to evaluate performance. Our results show that (1) OCM is capable of interconnecting four DDR4 memory channels to a computing node using two fibers with 1.07 pJ energy-per-bit consumption and (2) OCM performs up to 5.5x faster than a disaggregated memory with 40G PCIe NIC connectors to computing nodes.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129168661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Scheduling Methods to Reduce Response Latency of Function as a Service 减少功能即服务响应延迟的调度方法
P. Żuk, K. Rządca
{"title":"Scheduling Methods to Reduce Response Latency of Function as a Service","authors":"P. Żuk, K. Rządca","doi":"10.1109/SBAC-PAD49847.2020.00028","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00028","url":null,"abstract":"Function as a Service (FaaS) permits cloud customers to deploy to cloud individual functions, in contrast to complete virtual machines or Linux containers. All major cloud providers offer FaaS products (Amazon Lambda, Google Cloud Functions, Azure Serverless); there are also popular open-source implementations (Apache OpenWhisk) with commercial offerings (Adobe I/O Runtime, IBM Cloud Functions). A new feature of FaaS is function composition: a function may (sequentially) call another function, which, in turn, may call yet another function - forming a chain of invocations. From the perspective of the infrastructure, a composed FaaS is less opaque than a virtual machine or a container. We show that this additional information enables the infrastructure to reduce the response latency. In particular, knowing the sequence of future invocations, the infrastructure can schedule these invocations along with environment preparation. We model resource management in FaaS as a scheduling problem combining (1) sequencing of invocations, (2) deploying execution environments on machines, and (3) allocating invocations to deployed environments. For each aspect, we propose heuristics. We explore their performance by simulation on a range of synthetic workloads. Our results show that if the setup times are long compared to invocation times, algorithms that use information about the composition of functions consistently outperform greedy, myopic algorithms, leading to significant decrease in response latency.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124712831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems sputniPIC:一种用于多gpu系统的隐式单元内粒子代码
Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis
{"title":"sputniPIC: An Implicit Particle-in-Cell Code for Multi-GPU Systems","authors":"Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, I. Peng, Artur Podobas, S. Markidis","doi":"10.1109/SBAC-PAD49847.2020.00030","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00030","url":null,"abstract":"Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
TASO: Time and Space Optimization for Memory-Constrained DNN Inference 记忆约束下深度神经网络推理的时间和空间优化
Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg
{"title":"TASO: Time and Space Optimization for Memory-Constrained DNN Inference","authors":"Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg","doi":"10.1109/SBAC-PAD49847.2020.00036","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00036","url":null,"abstract":"Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate work space that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNetand SqueezeNet) on the ARM Cortex-A15 yields speedups of 8× compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2× while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132770003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing AIR:基于异步迭代路由的轻量级高性能数据流引擎
V. E. Venugopal, M. Theobald, Samira Chaychi, Amal Tawakuli
{"title":"AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing","authors":"V. E. Venugopal, M. Theobald, Samira Chaychi, Amal Tawakuli","doi":"10.1109/SBAC-PAD49847.2020.00018","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00018","url":null,"abstract":"Distributed Stream Processing Engines (DSPEs) are currently among the most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. In this paper, we describe the architecture of our AIR engine, which is designed from scratch in C++ using the Message Passing Interface (MPI), pthreads for multithreading, and is directly deployed on top of a common HPC workload manager such as SLURM. AIR implements a light-weight, dynamic sharding protocol (referred to as \"Asynchronous Iterative Routing\"), which facilitates a direct and asynchronous communication among all worker nodes and thereby completely avoids any additional communication overhead with a dedicated master node. With its unique design, AIR fills the gap between the prevalent scale-out (but Java-based) architectures like Apache Spark and Flink, on one hand, and recent scale-up (and C++ based) prototypes such as StreamBox and PiCo, on the other hand. Our experiments over various benchmark settings confirm that AIR performs as good as the best scale-up SPEs on a single-node setup, while it outperforms existing scale-out DSPEs in terms of processing latency and sustainable throughput by a factor of up to 15 in a distributed setting.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114462866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信