2019 IEEE International Symposium on Workload Characterization (IISWC)最新文献_第2页

A Closer Look at Lightweight Graph Reordering 对轻量级图重排序的深入研究

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9041948

P. Faldu, Jeff Diamond, Boris Grot

{"title":"A Closer Look at Lightweight Graph Reordering","authors":"P. Faldu, Jeff Diamond, Boris Grot","doi":"10.1109/IISWC47752.2019.9041948","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9041948","url":null,"abstract":"Graph analytics power a range of applications in areas as diverse as finance, networking and business logistics. A common property of graphs used in the domain of graph analytics is a power-law distribution of vertex connectivity, wherein a small number of vertices are responsible for a high fraction of all connections in the graph. These richly-connected (hot) vertices inherently exhibit high reuse. However, their sparse distribution in memory leads to a severe underutilization of on-chip cache capacity. Prior works have proposed lightweight skew-aware vertex reordering that places hot vertices adjacent to each other in memory, reducing the cache footprint of hot vertices and thus improving cache efficiency. However, in doing so, they may inadvertently destroy the inherent community structure within the graph, which may negate the performance gains achieved from the reduced footprint of hot vertices. In this work, we study existing reordering techniques and demonstrate the inherent tension between reducing the cache footprint of hot vertices and preserving original graph structure. We quantify the potential performance loss due to disruption in graph structure for different graph datasets. We further show that reordering techniques that employ fine-grain reordering significantly increase misses in the higher level caches, even when they reduce misses in the last level cache. To overcome the limitations of existing reordering techniques, we propose Degree-Based Grouping (DBG), a novel lightweight reordering technique that employs a coarse-grain reordering to largely preserve graph structure while reducing the cache footprint of hot vertices. Our evaluation on 40 combinations of various graph applications and datasets shows that, compared to a baseline with no reordering, DBG yields an average application speed-up of 16.8% vs 11.6% for the best-performing existing lightweight technique.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134210628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Performance-driven Programming of Multi-TFLOP Deep Learning Accelerators*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042017

Swagath Venkataramani, Jungwook Choi, V. Srinivasan, K. Gopalakrishnan, Leland Chang

{"title":"Performance-driven Programming of Multi-TFLOP Deep Learning Accelerators*","authors":"Swagath Venkataramani, Jungwook Choi, V. Srinivasan, K. Gopalakrishnan, Leland Chang","doi":"10.1109/IISWC47752.2019.9042017","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042017","url":null,"abstract":"Deep Neural Network (DNN) accelerator architecture have evolved rapidly in recent years demonstrating impressive peak processing efficiencies. However, little effort has been devoted towards developing systematic methodologies to program DNN accelerators to extract the best accelerator utilization across a range of DNN workloads. This becomes critical as DNN layers vary dramatically in their computational characteristics, necessitating them to be programmed differently to maximize overall performance. In this work, we address this challenge in the context of the RaPiD multi-TFLOP DNN accelerator proposed in [1], which comprises of a 2D-systolic array of processing elements, a 1D-array of special function units and a scratchpad memory. We develop DeepMatrix, a framework that enables systematic exploration of the design space to map DNNs to a given accelerator architecture, which can discover even non-intuitive optimization strategies to achieve high utilization. Specifically, given a DNN, it identifies how the computations need to be spatiotemporally sequenced, how much data needs to be staged at each level in the memory hierarchy and when data-transfers between memory hierarchies need to occur so that performance is maximized while meeting the constraints imposed by the hardware (processing power, memory capacity, bandwidth etc). DeepMatrix achieves this by building a parameterized design space of mapping configurations, and uses a design space exploration methodology to identify the best configuration. Across multiple large and practical DNNs (AlexNet, ResNet, VGG), we demonstrate DeepMatrix can yield 1.4x−2.8x improvement in performance over hand-tuned homogenous mapping.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"31 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116673606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fingerprinting Anomalous Computation with RNN for GPU-accelerated HPC Machines* 基于RNN的gpu加速HPC机指纹异常计算*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042165

Pengfei Zou, Ang Li, K. Barker, Rong Ge

引用次数: 5

Barrier Synchronization vs. Voltage Noise: A Quantitative Analysis* 势垒同步与电压噪声:定量分析*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9041950

Z. Chowdhury, S. K. Khatamifard, Zhaoyong Zheng, T. Moreshet, R. I. Bahar, Ulya R. Karpuzcu

引用次数: 2

Efficacy of Statistical Sampling on Contemporary Workloads: The Case of SPEC CPU2017 统计抽样对当代工作负载的有效性:以SPEC CPU2017为例

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042114

Sarabjeet Singh, M. Awasthi

{"title":"Efficacy of Statistical Sampling on Contemporary Workloads: The Case of SPEC CPU2017","authors":"Sarabjeet Singh, M. Awasthi","doi":"10.1109/IISWC47752.2019.9042114","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042114","url":null,"abstract":"New benchmark suites are constantly being released, with each one providing a much larger set of benchmarks, representing an ever-growing variety of workloads. Contemporary workloads are increasingly more complex in their computational and memory footprints. Most computer architecture research is based on the ability of researchers to simulate novel ideas with a variety of workloads representing the domain being researched. However, bigger and complex benchmarks suites have made it extremely impractical to simulate complete benchmarks from start to finish. As a result, architects are becoming increasingly dependent on statistical sampling techniques like SimPoints, which identify long, repetitive execution phases in benchmarks, and limit simulations to a few instances of these phases. These techniques present an inherent trade-off between simulation speed and accuracy. This work presents results and insights for determining the accuracy of simulation points for the SPEC CPU2017 suite, using Pin and PinPoints, which is an implementation of SimPoints for the x86 ISA. Our analysis concludes that carefully chosen simulation points faithfully represent the workload; we observe <1% variance in the instruction distribution between full runs and the ones using SimPoints, while reducing simulation time by ~750x. We also show that on average, just 12 phases can faithfully represent the 90th percentile of a benchmark's behavior, which can help reduce the overall simulation time by up to ~1297x. In addition, using performance statistics with native binaries on real hardware and from an architectural model of the same machine using SimPoints, we report good co-relations between the two on metrics such as CPI. Finally, we present cases like memory hierarchy explorations, where SimPoints should be used judiciously and with extreme caution in order to derive correct conclusions - inappropriately chosen SimPoint configurations can show large deviations in memory hierarchy behavior as compared to full runs, as reported by prior studies.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114432395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Detecting Last-Level Cache Contention in Workload Colocation with Meta Learning 使用元学习检测工作负载托管中的最后一级缓存争用

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9041983

Huanxing Shen, Cong Li

{"title":"Detecting Last-Level Cache Contention in Workload Colocation with Meta Learning","authors":"Huanxing Shen, Cong Li","doi":"10.1109/IISWC47752.2019.9041983","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9041983","url":null,"abstract":"While workload colocation improves cluster utilization in cloud environments, it introduces performance-impacting contentions on unmanaged resources. We address the problem of detecting the contentions on last-level cache with low level platform counters, but without application performance data. The detection is performed in a noisy environment with a mix of contention cases and non-contention cases, but without the ground truth. We propose a meta-learning approach to discriminate the increase of cache miss metrics taking the cache occupancy data as the precondition. We assume that given a certain workload intensity, when the cache occupancy of the workload drops below its hot data size, increasing cache misses will be observed. Leveraging the assumption, the threshold of cache miss metrics to detect cache interference under the workload intensity is found by inducing the most discriminating rule from the noisy history. Similarly, we determine whether the cache interference impacts performance by discriminating the increase of cycles per instruction metrics with the interference signal. Experimental results indicate that the new approach achieves a decent performance in identifying cache contentions with performance impact in noisy environments.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122387517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures* SimdHT-Bench:在新兴CPU架构上表征simd感知哈希表设计*

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042069

D. Shankar, Xiaoyi Lu, D. Panda

{"title":"SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures*","authors":"D. Shankar, Xiaoyi Lu, D. Panda","doi":"10.1109/IISWC47752.2019.9042069","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042069","url":null,"abstract":"With the emergence of modern multi-core CPU architectures that support data parallelism via vectorization, several storage systems have been employing SIMD-based techniques to optimize data-parallel operations on in-memory structures like hash-tables. In this paper, we perform an in-depth characterization of the opportunities for incorporating AVX vectorization-based SIMD-aware designs for hash table lookups on emerging CPU architectures. We analyze the challenges and design dimensions involved in exploiting vectorization-based parallel key searching over cache-optimized non-SIMD hash tables. Based on this, we design a comprehensive micro-benchmark suite, SimdHT-Bench, that enables evaluating the performance and applicability of CPU SIMD-aware hash table designs for accelerating different read-intensive workloads. With SimdHT-Bench, we study five different use-case scenarios with varied workload patterns, on the latest Intel Skylake and Intel Cascade Lake multi-core CPU nodes. Further, to validate the applicability of SimdHT-Bench, we employ these performance studies to design a high-performance SIMD-aware RDMA-based in-memory key-value store to accelerate the Memcached ‘Multi-Get’ workload. We demonstrate that the SIMD-integrated designs can achieve up to 1.45x-2.04x improvement in server-side Get throughput and up to 34% improvement in end-to-end Multi-Get latencies over the state-of-the-art CPU-optimized non-SIMD MemC3 hash table design, on a high-performance compute cluster with Intel Skylake processors and InfiniBand EDR interconnects.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127998547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Autonomous Data-Race-Free GPU Testing 自主数据无竞争GPU测试

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042019

T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann

{"title":"Autonomous Data-Race-Free GPU Testing","authors":"T. Ta, Xianwei Zhang, Anthony Gutierrez, Bradford M. Beckmann","doi":"10.1109/IISWC47752.2019.9042019","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042019","url":null,"abstract":"As the deep learning and high-performance computing markets continue to grow, hardware designers are increasingly optimizing future GPUs to run compute (a.k.a. GPGPU) workloads. A key area of optimization for these compute-oriented designs, which was not emphasized when GPUs exclusively executed graphics workloads, is inter-thread data sharing and synchronization. GPU cache coherence protocols now support these operations and are governed by a specified memory consistency model. In general, current GPU models are based on sequential consistency for data-race-free (SC for DRF), which mandates data written to memory must be globally visible only after certain synchronization points. GPU coherence protocols based on such relaxed memory models are particularly difficult to design and test due to the large number of memory accesses that may be reordered. This leaves GPU hardware designers struggling to validate the correctness of GPU cache coherence optimizations. To address this issue, this paper introduces a novel, completely autonomous random testing methodology for complex GPU cache coherence protocols. Our framework continuously generates sequences of memory requests with minimal user intervention using a mix of load, store, and atomic operations. The tester dynamically and autonomously checks each response against an expected global view of memory and immediately detects any inconsistencies in a target coherence protocol, providing designers detailed feedback on the issue. We then demonstrate the methodology on the popular cycle-level gem5 simulator by replacing its GPU core model with our unique testing framework. The results show that the GPU tester can cover 94% and 100% of all reachable state transitions in L1 and L2 caches respectively of a representative GPU coherence protocol. This coverage is 6.25% and 25% higher than the one achieved by a wide selection of 26 applications. In addition, the tester runs more than 50 times faster than those applications, which enables efficient and fast protocol debugging.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114791390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multi-Bit Upsets Vulnerability Analysis of Modern Microprocessors 现代微处理器的多位扰流漏洞分析

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9042036

Athanasios Chatzidimitriou, G. Papadimitriou, Christos Gavanas, George Katsoridas, D. Gizopoulos

{"title":"Multi-Bit Upsets Vulnerability Analysis of Modern Microprocessors","authors":"Athanasios Chatzidimitriou, G. Papadimitriou, Christos Gavanas, George Katsoridas, D. Gizopoulos","doi":"10.1109/IISWC47752.2019.9042036","DOIUrl":"https://doi.org/10.1109/IISWC47752.2019.9042036","url":null,"abstract":"Miniaturization of integrated circuits brings more devices (thus more functionality) on the same silicon area but also makes them more vulnerable to soft (transient) errors. Assessment and understanding of the magnitude of a microprocessor's vulnerability to soft errors in early stages of the design can steer wise, cost-effective protection decision at the hardware or software level. In recent fabrication technologies, the effect of radiation (neutrons or other particles) is significantly more severe on silicon devices and leads to increased numbers of multi-bit upsets. In this paper, we analyze the effects of multi-bit upsets in modern microprocessors, using microarchitecture level fault injection and a complete system stack. We present details about the effects of multi-bit upsets on 6 major hardware components of an ARM Cortex-A9 CPU modeled on Gem5 microarchitectural simulator, with 15 workloads across 8 fabrication technology nodes. For the purposes of our analysis, we employ and extend the GeFIN (Gem5-based Fault INjector) framework to model and analyze multi-bit faults in the hardware structures of the CPU. The enhanced version of the fault injector models multi-bit faults in adjacent areas of a structure; a very realistic case when modern silicon chips are affected by radiation. Our analysis shows that the architectural vulnerability factor (AVF) significantly increases from 1.5x (+50%) to 3.2x (+220%) between single and triple-bit faults across components. We present the aggregate multi-bit AVF of each hardware structure and each technology node from 250nm to 22nm; our results show significant AVF difference between single bit and aggregate multi-bit measurements, up to 35% as the technology node decreases - this reveals the magnitude of the assessment gap when only single bit errors are considered by any method. We report soft error Failures in Time (FIT) rates for the entire ARM Cortex-A9 CPU across technology nodes and our results show that the contribution of multi-bit upsets in the overall CPU FIT consistently increases across technologies and reaches 21% in 22nm.","PeriodicalId":121068,"journal":{"name":"2019 IEEE International Symposium on Workload Characterization (IISWC)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133512599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Deep Learning Language Modeling Workloads: Where Time Goes on Graphics Processors 深度学习语言建模工作负载:图形处理器的发展方向

2019 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2019-11-01 DOI: 10.1109/IISWC47752.2019.9041972

Ali Hadi Zadeh, Zissis Poulos, Andreas Moshovos

引用次数: 8