2020 IEEE International Symposium on Workload Characterization (IISWC)最新文献_第3页

Organizing Committee : IISWC 2020 组委会:IISWC 2020

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00008

引用次数: 0

MATCH: An MPI Fault Tolerance Benchmark Suite MATCH:一个MPI容错基准测试套件

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00015

Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li

{"title":"MATCH: An MPI Fault Tolerance Benchmark Suite","authors":"Luanzheng Guo, G. Georgakoudis, K. Parasyris, I. Laguna, Dong Li","doi":"10.1109/IISWC50251.2020.00015","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00015","url":null,"abstract":"MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128815415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud AWS云上CNN运算次数的实证分析与建模

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00026

Ubaid Ullah Hafeez, Anshul Gandhi

{"title":"Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud","authors":"Ubaid Ullah Hafeez, Anshul Gandhi","doi":"10.1109/IISWC50251.2020.00026","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00026","url":null,"abstract":"Given the widespread use of Convolutional Neural Networks (CNNs) in image classification applications, cloud providers now routinely offer several GPU-equipped instances with varying price points and hardware specifications. From a practitioner's perspective, given an arbitrary CNN, it is not obvious which GPU instance should be employed to minimize the model training time and/or rental cost. This paper presents Ceer, a model-driven approach to determine the optimal GPU instance(s) for any given CNN. Based on an operation-level empirical analysis of various CNNs, we develop regression models for heavy GPU operations (where input size is a key feature) and employ the sample median estimator for light GPU and CPU operations. To estimate the communication overhead between CPU and GPU(s), especially in the case of multi-GPU training, we develop a model that relates this communication overhead to the number of model parameters in the CNN. Evaluation results on AWS Cloud show that Ceer can accurately predict training time and cost (less than 5% average prediction error) across CNNs, enabling 36% −44% cost savings over simpler strategies that employ the cheapest or the latest generation GPU instances.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Reconfigurable Accelerator Compute Hierarchy: A Case Study using Content-Based Image Retrieval 可重构加速器计算层次:使用基于内容的图像检索的案例研究

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00034

Nazanin Farahpour, Y. Hao, Zhenman Fang, Glenn D. Reinman

{"title":"Reconfigurable Accelerator Compute Hierarchy: A Case Study using Content-Based Image Retrieval","authors":"Nazanin Farahpour, Y. Hao, Zhenman Fang, Glenn D. Reinman","doi":"10.1109/IISWC50251.2020.00034","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00034","url":null,"abstract":"The recent adoption of reconfigurable hardware accelerators in data centers has significantly improved their computational power and energy efficiency for compute-intensive applications. However, for common communication-bound analytics workloads, these benefits are limited by the efficiency of data movement in the IO stack. For this reason, server architects are proposing a more data-centric acceleration scheme by moving the compute elements closer to the data. While prior studies focus on the benefits of Near Data Processing (NDP) solely on one level of the memory hierarchy (one of cache, main memory or storage), we focus on the collaboration of NDP accelerators at all levels and their collective benefits in accelerating an application pipeline. In this paper, we present a Reconfigurable Accelerator Compute Hierarchy (ReACH) that combines on-chip, near-memory, and near-storage accelerators. Each memory level has a reconfigurable accelerator chip attached to it, which provides distinct compute and memory capabilities and offers a broad spectrum of acceleration options. To enable effective acceleration on various application pipelines, we propose a holistic approach to coordinate between the compute levels, reducing inter-level data access interference and achieving asynchronous task flow control. To minimize the programming efforts of using the compute hierarchy, a uniform programming interface is designed to decouple the ReACH configuration from the user application source code and allow runtime adjustments without modifying the deployed application. We experimentally deploy a billion-scale Content-Based Image Retrieval (CBIR) system on ReACH. Simulation results demonstrate that a proper application mapping eliminates unnecessary data movement, and ReACH achieves 4.5x throughput gain while reducing energy consumption by 52% compared to conventional on-chip acceleration.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129123057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reliability Modeling of NISQ- Era Quantum Computers NISQ时代量子计算机的可靠性建模

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00018

Ji Liu, Huiyang Zhou

{"title":"Reliability Modeling of NISQ- Era Quantum Computers","authors":"Ji Liu, Huiyang Zhou","doi":"10.1109/IISWC50251.2020.00018","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00018","url":null,"abstract":"Recent developments in quantum computers have been pushing up the number of qubits. However, the state-of-the-art Noisy Intermediate Scale Quantum (NISQ) computers still do not have enough qubits to accommodate the error correction circuit. Noise in quantum gates limits the reliability of quantum circuits. To characterize the noise effects, prior methods such as process tomography, gateset tomography and randomized benchmarking have been proposed. However, the challenge is that these methods do not scale well with the number of qubits. Noise models based on the understanding of underneath physics have also been proposed to study different kinds of noise in quantum computers. The difficulty is that there is no widely accepted noise model that incorporates all different kinds of errors. The realworld errors can be very complicated and it remains an active area of research to produce accurate noise models. In this paper, instead of using noise models to estimate the reliability, which is measured with success rates or inference strength, we treat the NISQ quantum computer as a black box. We use several quantum circuit characteristics such as the number of qubits, circuit depth, the number of CNOT gates, and the connection topology of the quantum computer as inputs to the black box and derive a reliability estimation model using (1) polynomial fitting and (2) a shallow neural network. We propose randomized benchmarks with random numbers of qubits and basic gates to generate a large data set for neural network training. We show that the estimated reliability from our black-box model outperforms the noise models from Qiskit. We also showcase that our black-box model can be used to guide quantum circuit optimization at compile time.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation 现实世界图的顶点重排序及其应用:一个经验评价

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00031

Reet Barik, Marco Minutoli, M. Halappanavar, Nathan R. Tallent, A. Kalyanaraman

{"title":"Vertex Reordering for Real-World Graphs and Applications: An Empirical Evaluation","authors":"Reet Barik, Marco Minutoli, M. Halappanavar, Nathan R. Tallent, A. Kalyanaraman","doi":"10.1109/IISWC50251.2020.00031","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00031","url":null,"abstract":"Vertex reordering is a way to improve locality in graph computations. Given an input (or “natural”) order, reordering aims to compute an alternate permutation of the vertices that is aimed at maximizing a locality-based objective. Given decades of research on this topic, there are tens of graph reordering schemes, and there are also several linear arrangement “gap” measures for treatment as objectives. However, a comprehensive empirical analysis of the efficacy of the ordering schemes against the different gap measures, and against real-world applications is currently lacking. In this study, we present an extensive empirical evaluation of up to 11 ordering schemes, taken from different classes of approaches, on a set of 34 real-world graphs emerging from different application domains. Our study is presented in two parts: a) a thorough comparative evaluation of the different ordering schemes on their effectiveness to optimize different linear arrangement gap measures, relevant to preserving locality; and b) extensive evaluation of the impact of the ordering schemes on two real-world, parallel graph applications, namely, community detection and influence maximization. Our studies show a significant divergence among the ordering schemes (up to 40x between the best and the poor) in their effectiveness to reduce the gap measures; and a wide ranging impact of the ordering schemes on various aspects including application runtime (up to 4x), memory and cache use, load balancing, and parallel work and efficiency. The comparative study also helps in revealing the nuances of a parallel environment (compared to serial) on the ordering schemes and their role in optimizing applications.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121438128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Keynote #1, Keynote #2 IISWC 2020 主题1，主题2 IISWC 2020

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00037

引用次数: 0

CPI for Runtime Performance Measurement: The Good, the Bad, and the Ugly 用于运行时性能测量的CPI:好、坏和丑

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00019

Li Yi, Cong Li, Jianmei Guo

{"title":"CPI for Runtime Performance Measurement: The Good, the Bad, and the Ugly","authors":"Li Yi, Cong Li, Jianmei Guo","doi":"10.1109/IISWC50251.2020.00019","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00019","url":null,"abstract":"Originally used for micro-architectural performance characterization, the metric of cycles per instruction (CPI) is now emerging as a proxy for workload performance measurement in runtime cloud environments. It has been used to evaluate the performance per workload before and after applying a system configuration change and to detect contentions on the micro-architectural resources in workload colocation. In this paper, we re-examine the use of CPI on two representative cloud computing workloads. An alternative metric, reference cycles per instruction (RCPI), is defined for comparison. We show that CPI is more sensitive than RCPI in identifying micro-architectural performance change in some cases. However, in the other cases with a different frequency scaling, we observe a better CPI value given a worse performance. We conjecture that both the observations are due to the bias of CPI towards scenarios with a low core frequency. We next demonstrate that a significant change in either CPI or RCPI does not necessarily indicate a boost or loss in performance, since both CPI and RCPI are dependent on workload intensities. It implies that the use of CPI without referring to the workload intensity is probably inappropriate. This provokes the discussion of the right way to use CPI, e.g., modeling CPI as a dependent variable given other relevant factors as the independent variables.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121917063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Program Committee : IISWC 2020 项目委员会:IISWC 2020

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/iiswc50251.2020.00007

引用次数: 0

High Frequency Performance Monitoring via Architectural Event Measurement 通过架构事件度量进行高频性能监控

2020 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2020-10-01 DOI: 10.1109/IISWC50251.2020.00020

Chutitep Woralert, James Bruska, Chen Liu, Lok K. Yan

{"title":"High Frequency Performance Monitoring via Architectural Event Measurement","authors":"Chutitep Woralert, James Bruska, Chen Liu, Lok K. Yan","doi":"10.1109/IISWC50251.2020.00020","DOIUrl":"https://doi.org/10.1109/IISWC50251.2020.00020","url":null,"abstract":"Obtaining detailed software execution information via hardware performance counters is a powerful analysis technique. The performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this paper, we present K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low overhead, periodic performance counter data using a kernel module based design. Our proposed approach has been evaluated on three different case studies to demonstrate its effectiveness, correctness and efficiency. By moving the responsibility of timing to kernel space, K-LEB can gather periodic data at a 100μs rate, which is 100 times faster than other comparable performance counter monitoring approaches. At the same time, it reduces the monitoring overhead by at least 58.8%, and the difference between the recorded performance counter readings and those of other tools are less than 0.3%.","PeriodicalId":365983,"journal":{"name":"2020 IEEE International Symposium on Workload Characterization (IISWC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128686835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4