2019 IEEE International Symposium on Workload Characterization (IISWC)最新文献_第3页

Optimizing Hyperplane Sweep Operations Using Asynchronous Multi-grain GPU Tasks 使用异步多粒GPU任务优化超平面扫描操作

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042134

A. Kaushik, Ashwin M. Aji, M. A. Hassaan, N. Chalmers, Noah Wolfe, Scott Moe, Sooraj Puthoor, Bradford M. Beckmann

引用次数: 2

An Overflow-free Quantized Memory Hierarchy in General-purpose Processors 通用处理器中无溢出的量化内存层次结构

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042035

Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan

{"title":"An Overflow-free Quantized Memory Hierarchy in General-purpose Processors","authors":"Marzieh Lenjani, Patricia González, Elaheh Sadredini, M Arif Rahman, M. Stan","doi":"10.1109/IISWC47752.2019.9042035","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042035","url":null,"abstract":"Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1–7, 9–15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average $1.40/1.45/1.56times$ speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides $1.23times$ speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132082178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Characterizing Deep Learning Training Workloads on Alibaba-PAI 基于阿里巴巴- pai的深度学习训练工作量表征

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-10-14 DOI: 10.1109/IISWC47752.2019.9042047

Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, Yangqing Jia

{"title":"Characterizing Deep Learning Training Workloads on Alibaba-PAI","authors":"Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei Lin, Yangqing Jia","doi":"10.1109/IISWC47752.2019.9042047","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042047","url":null,"abstract":"Modern deep learning models have been exploited in various domains, including computer vision (CV), natural language processing (NLP), search and recommendation. In practical AI clusters, workloads training these models are run using software frameworks such as TensorFlow, Caffe, PyTorch and CNTK. One critical issue for efficiently operating practical AI clouds, is to characterize the computing and data transfer demands of these workloads, and more importantly, the training performance given the underlying software framework and hardware configurations. In this paper, we characterize deep learning training workloads from Platform of Artificial Intelligence (PAI) in Alibaba. We establish an analytical framework to investigate detailed execution time breakdown of various workloads using different training architectures, to identify performance bottleneck. Results show that weight/gradient communication during training takes almost 62% of the total execution time among all our workloads on average. The computation part, involving both GPU computing and memory access, are not the biggest bottleneck based on collective behavior of the workloads. We further evaluate attainable performance of the workloads on various potential software/hardware mappings, and explore implications on software architecture selection and hardware configurations. We identify that 60% of PS/Worker workloads can be potentially sped up when ported to the AllReduce architecture exploiting the high-speed NVLink for GPU interconnect, and on average 1.7X speedup can be achieved when Ethernet bandwidth is upgraded from 25 Gbps to 100 Gbps.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"32 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122563437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Optimizing GPU Cache Policies for MI Workloads* 优化GPU缓存策略为MI工作负载*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-09-30 DOI: 10.1109/IISWC47752.2019.9041977

Johnathan Alsop, Matthew D. Sinclair, Srikant Bharadwaj, A. Duțu, Anthony Gutierrez, Onur Kayiran, Michael LeBeane, Sooraj Puthoor, Xianwei Zhang, T. Yeh, Bradford M. Beckmann

引用次数: 5

HolDCSim: A Holistic Simulator for Data Centers* HolDCSim:数据中心的整体模拟器*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-09-30 DOI: 10.1109/IISWC47752.2019.9042105

Fan Yao, Kathy Ngyugen, Sai Santosh Dayapule, Jingxin Wu, Bingqian Liu, S. Subramaniam, Guru Venkataramani

引用次数: 1

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs 基于性能感知的嵌入式gpu卷积神经网络通道修剪

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-08-22 DOI: 10.1109/IISWC47752.2019.9042000

Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, A. Storkey, Michael F. P. O'Boyle

{"title":"Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs","authors":"Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, José Cano, Elliot J. Crowley, Björn Franke, A. Storkey, Michael F. P. O'Boyle","doi":"10.1109/IISWC47752.2019.9042000","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042000","url":null,"abstract":"Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2× slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3× with cuDNN and above 10× with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122826228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Workload-Aware DRAM Error Prediction using Machine Learning 基于机器学习的工作负载感知DRAM错误预测

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-08-15 DOI: 10.1109/IISWC47752.2019.9041963

L. Mukhanov, Konstantinos Tovletoglou, H. Vandierendonck, Dimitrios S. Nikolopoulos, G. Karakonstantis

{"title":"Workload-Aware DRAM Error Prediction using Machine Learning","authors":"L. Mukhanov, Konstantinos Tovletoglou, H. Vandierendonck, Dimitrios S. Nikolopoulos, G. Karakonstantis","doi":"10.1109/IISWC47752.2019.9041963","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9041963","url":null,"abstract":"The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better tradeoff between energy efficiency and reliability. Existing modeling efforts may have studied the impact of few operating parameters and temperature on DRAM reliability using custom FPGAs setups, however they neglected the combined effect of workload-specific features that can be systematically investigated only on a real system. In this paper, we present the results of our study on workload-dependent DRAM error behavior within a real server considering various operating parameters, such as the refresh rate, voltage and temperature. We show that the rate of single- and multi-bit errors may vary across workloads by 8x, indicating that program inherent features can affect DRAM reliability significantly. Based on this observation, we extract 249 features, such as the memory access rate, the rate of cache misses, the memory reuse time and data entropy, from various compute-intensive, caching and analytics benchmarks. We apply several supervised learning methods to construct the DRAM error behavior model for 72 server-grade DRAM chips using the memory operating parameters and extracted program inherent features. Our results show that, with an appropriate choice of program features and supervised learning method, the rate of single- and multi-bit errors can be predicted for a specific DRAM module with an average error of less than 10.5 %, as opposed to the 2.9x estimation error obtained for a conventional workload-unaware error model. Our model enables designers to predict DRAM errors in advance for less than a second and study the impact of any workload and applied software optimizations on DRAM reliability.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132897690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Branch Prediction Is Not A Solved Problem: Measurements, Opportunities, and Future Directions 分支预测不是一个已解决的问题:度量、机会和未来方向

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-06-13 DOI: 10.1109/IISWC47752.2019.9042108

Chit-Kwan Lin, Stephen J. Tarsa

{"title":"Branch Prediction Is Not A Solved Problem: Measurements, Opportunities, and Future Directions","authors":"Chit-Kwan Lin, Stephen J. Tarsa","doi":"10.1109/IISWC47752.2019.9042108","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042108","url":null,"abstract":"Modern branch predictors predict the vast majority of conditional branch instructions with near-perfect accuracy, allowing superscalar, out-of-order processors to maximize speculative efficiency and thus performance. However, this impressive overall effectiveness belies a substantial missed opportunity in single-threaded instructions per cycle (IPC). For example, we show that correcting the mispredictions made by the state-of-the-art TAGE-SC-L branch predictor on SPECint 2017 would improve IPC by margins similar to an advance in process technology node. In this work, we measure and characterize these mispredictions. We find that they categorically arise from either (1) a small number of systematically hard-to-predict (H2P) branches; or (2) rare branches with low dynamic execution counts. Using data from SPECint 2017 and additional large code footprint applications, we quantify the occurrence and IPC impact of these two categories. We then demonstrate that solely increasing the resources afforded to existing branch predictors does not address the root causes of most mispredictions. This leads us to reexamine basic assumptions in branch prediction and to propose new research directions that, for example, deploy machine learning to improve pattern matching for H2Ps, and use on-chip phase learning to track long-term statistics for rare branches.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115304070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19