2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第3页

On the Impact of Instruction Address Translation Overhead 论指令地址转换开销的影响

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00018

Yufeng Zhou, Xiaowan Dong, A. Cox, S. Dwarkadas

{"title":"On the Impact of Instruction Address Translation Overhead","authors":"Yufeng Zhou, Xiaowan Dong, A. Cox, S. Dwarkadas","doi":"10.1109/ISPASS.2019.00018","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00018","url":null,"abstract":"Even on modern processors with their ever larger instruction translation lookaside buffers (TLBs), we find that a variety of widely used applications, ranging from compilers to web user-interface frameworks, suffer from high instruction address translation overheads. In this paper, we explore the efficacy of different operating system-level approaches to automatically reducing this instruction address translation overhead. Specifically, we evaluate the use of automatic superpage promotion and page table sharing as well as a transparent padding mechanism that enables small code regions to be mapped using superpages. Overall, we find that the combined effects of these different approaches can reduce an application's total execution cycles by up to 18%. Surprisingly, we find that improving address translation performance in the first-level instruction TLB can significantly reduce the address translation overhead for data accesses. The overall reduction in execution cycles is more than double the instruction address translation overhead on stock FreeBSD, demonstrating that data address translation and access synergistically benefit from less contention in the caches and TLBs that might be shared across instruction and data.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134151868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

2019 IEEE International Symposium on Performance Analysis of Systems and Software 2019 IEEE系统与软件性能分析国际研讨会

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ispass.2009.4919624

L. Eeckhout, David Brooks

引用次数: 1

HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators HeteroMap:异构多加速器上图形分析高效处理的运行时性能预测器

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00039

Masab Ahmad, H. Dogan, Christopher J. Michael, O. Khan

{"title":"HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators","authors":"Masab Ahmad, H. Dogan, Christopher J. Michael, O. Khan","doi":"10.1109/ISPASS.2019.00039","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00039","url":null,"abstract":"With the ever-increasing amount of data and input variations, portable performance is becoming harder to exploit on today's architectures. Computational setups utilize single-chip processors, such as GPUs or large-scale multicores for graph analytics. Some algorithm-input combinations perform more efficiently when utilizing a GPU's higher concurrency and bandwidth, while others perform better with a multicore's stronger data caching capabilities. Architectural choices also occur within selected accelerators, where variables such as threading and thread placement need to be decided for optimal performance. This paper proposes a performance predictor paradigm for a heterogeneous parallel architecture where multiple disparate accelerators are integrated in an operational high performance computing setup. The predictor aims to improve graph processing efficiency by exploiting the underlying concurrency variations within and across the heterogeneous integrated accelerators using graph benchmark and input characteristics. The evaluation shows that intelligent and real-time selection of near-optimal concurrency choices provides performance benefits ranging from 5 % to 3.8 x, and an energy benefit averaging around 2.4 x over the traditional single-accelerator setup.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123440388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A Detailed Model for Contemporary GPU Memory Systems 当代GPU内存系统的详细模型

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00023

Mahmoud Khairy, Akshay Jain, Tor M. Aamodt, Timothy G. Rogers

引用次数: 6

The POP Detector: A Lightweight Online Program Phase Detection Framework POP检测器:一个轻量级的在线程序相位检测框架

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00013

Karl Taht, James Greensky, R. Balasubramonian

{"title":"The POP Detector: A Lightweight Online Program Phase Detection Framework","authors":"Karl Taht, James Greensky, R. Balasubramonian","doi":"10.1109/ISPASS.2019.00013","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00013","url":null,"abstract":"Real-time phase detection enables dynamic adaptation of systems based on different program behavior. Many phase detection techniques have been proposed, with the most successful relating the phases back to application code. In the scope of online phase detection, techniques employ sampling to mitigate the overheads of the phase detection framework. When phase intervals are long enough, sampling approaches perform well. We reopen the question of phase interval length by performing in-depth analysis on the trade-offs between overhead and phase detector performance. We present a new metric which captures the statistical trade-off between phase interval length, phase stability, and the number of phases. We find that while shorter phases perform best in the context of online optimization, existing implementations suffer from performance degradation and overhead at shorter interval sizes. To address this gap, we present the Precise Online Phase (POP) detector. The POP detector utilizes performance counters to build signatures, which are virtually lossless at finer granularity. As a second order, the simplicity of the detector reduces the runtime overhead to just 1.35% and 0.09% at 10M and 100M instruction intervals, respectively.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator mRNA:实现重构神经加速器的高效映射空间探索

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00040

Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, T. Krishna

{"title":"mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator","authors":"Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, T. Krishna","doi":"10.1109/ISPASS.2019.00040","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00040","url":null,"abstract":"Deep learning accelerators have emerged to enable energy-efficient and high-throughput inference from edge devices such as self-driving cars and smartphones, to data centers for batch inference such as recommendation systems. However, the actual energy efficiency and throughput of a deep learning accelerator depends on the deep neural network (DNN) loop nest mapping on the processing element array of an accelerator. Moreover, the efficiency of a mapping dramatically changes by the target DNN layer dimensions and available hardware resources. Therefore, the optimal mapping search problem is a non-trivial high-dimensional optimization problem. Although several tools and frameworks exist for compiling to CPUs and GPUs, we lack similar tools for deep learning accelerators. To deal with the optimized mapping search problem in deep learning accelerators, we propose mRNA (mapper for reconfigurable neural accelerators), which automatically searches optimal mappings using heuristics based on domain knowledge about deep learning and an energy/runtime cost evaluation framework. mRNA targets MAERI, a recently proposed open-source deep learning accelerator that provides flexibility via reconfigurable interconnects, to run the unique mappings for each layer generated by mRNA. In realistic machine learning workloads from MLPerf, the optimal mappings identified by mRNA framework provides 15% to 26% lower runtime and 55% to 64% lower energy for convolutional layers and 24% to 67% lower runtime and maximum 67% lower energy for fully connected layers compared to simple reference mappings manually picked for each layer.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131042910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Emulating and Evaluating Hybrid Memory for Managed Languages on NUMA Hardware 基于NUMA硬件的托管语言混合内存仿真与评价

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00017

Shoaib Akram, Jennifer B. Sartor, K. McKinley, L. Eeckhout

{"title":"Emulating and Evaluating Hybrid Memory for Managed Languages on NUMA Hardware","authors":"Shoaib Akram, Jennifer B. Sartor, K. McKinley, L. Eeckhout","doi":"10.1109/ISPASS.2019.00017","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00017","url":null,"abstract":"Non-volatile memory (NVM) has the potential to become a mainstream memory technology and challenge DRAM. Researchers evaluating the speed, endurance, and abstractions of hybrid memories with DRAM and NVM typically use simulation, making it easy to evaluate the impact of different hardware technologies and parameters. Simulation is, however, extremely slow, limiting the applications and datasets in the evaluation. Simulation also precludes critical workloads, especially those written in managed languages such as Java and C#. Good methodology embraces a variety of techniques for evaluating new ideas, expanding the experimental scope, and uncovering new insights. This paper introduces a platform to emulate hybrid memory for managed languages using commodity NUMA servers. Emulation complements simulation but offers richer software experimentation. We use a thread-local socket to emulate DRAM and a remote socket to emulate NVM. We use standard C library routines to allocate heap memory on the DRAM and NVM sockets for use with explicit memory management or garbage collection. We evaluate the emulator using various configurations of write-rationing garbage collectors that improve NVM lifetimes by limiting writes to NVM, using 15 applications and various datasets and workload configurations. We show emulation and simulation confirm each other's trends in terms of writes to NVM for different software configurations, increasing our confidence in predicting future system effects. Emulation brings novel insights, such as the non-linear effects of multi-programmed workloads on NVM writes, and that Java applications write significantly more than their C++ equivalents. We make our software infrastructure publicly available to advance the evaluation of novel memory management schemes on hybrid memories.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Distributed Software Defined Networking Controller Failure Mode and Availability Analysis 分布式软件定义网络控制器故障模式及可用性分析

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00035

P. Reeser, Guilhem Tesseyre, Marcus Callaway

引用次数: 3

Demystifying Crypto-Mining: Analysis and Optimizations of Memory-Hard PoW Algorithms 揭秘加密挖掘:内存硬PoW算法的分析与优化

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00011

Runchao Han, N. Foutris, Christos Kotselidis

{"title":"Demystifying Crypto-Mining: Analysis and Optimizations of Memory-Hard PoW Algorithms","authors":"Runchao Han, N. Foutris, Christos Kotselidis","doi":"10.1109/ISPASS.2019.00011","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00011","url":null,"abstract":"Blockchain technology has become extremely popular, during the last decade, mainly due to the successful application in the cryptocurrency domain. Following the explosion of Bitcoin and other cryptocurrencies, blockchain solutions are being deployed in almost every aspect of transactional operations as a means to safely exchange digital assets between non-trusted parties. At the heart of every blockchain deployment is the consensus protocol, which maintains the consistency of the blockchain upon satisfying incoming transactions. Although many consensus protocols have been recently introduced, the most prevalent is Proof-of- Work, which scales the blockchain globally by converting the consensus problem to a competition based on cryptographic hash functions; a process called “mining”. The Proof-of- Work consensus protocol employs memory-hard algorithms in order to counteract ASIC or FPGA mining that may compromise the decentralization and democratization of the blockchain. Unfortunately, this leads to increased power consumption and scalability challenges since numerous processing units such as GPUs, FPGAs, and ASICs, are required to satisfy the ever-increasing operational requirements of blockchain deployments. In this paper, we perform an in-depth performance analysis and characterization of the most common memory-hard PoW algorithms running on NVIDIA GPUs. Motivated by our experimental findings, we apply a series of optimizations on Ethash algorithm, the consensus protocol of the Ethereum blockchain. The implemented optimizations accelerate performance by 14% and improve energy efficiency by 10% when executing on three NVIDIA GPUs. As a result, the optimized Ethash algorithm outperformed its fastest commercial implementation.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127936450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Assessing the Effects of Low Voltage in Branch Prediction Units 评估低压对支路预测装置的影响

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2019-03-01 DOI: 10.1109/ISPASS.2019.00020

Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos, Shrikanth Ganapathy, J. Kalamatianos

{"title":"Assessing the Effects of Low Voltage in Branch Prediction Units","authors":"Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos, Shrikanth Ganapathy, J. Kalamatianos","doi":"10.1109/ISPASS.2019.00020","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00020","url":null,"abstract":"Branch prediction units are key performance components in modern microprocessors as they are widely used to address control hazards and minimize misprediction stalls. The continuous urge of high performance has led designers to integrate highly sophisticated predictors with complex prediction algorithms and large storage requirements. As a result, BPUs in modern microprocessors consume large amounts of power. But when a system is under a limited power budget, critical decisions are required in order to achieve an equilibrium point between the BPU and the rest of the microprocessor. In this work, we present a comprehensive analysis of the effects of low voltage configuration Branch Prediction Units (BPU). We propose a design with separate voltage domain for the BPU, which exploits the speculative nature of the BPU (which is self-correcting) that allows reduction of power without affecting functional correctness. Our study explores how several branch predictor implementations behave when aggressively undervolted, the performance impact of BTB as well as in which cases it is more efficient to reduce the BP and BTB size instead of undervolting. We also show that protection of BPU SRAM arrays has limited potential to further increase the energy savings, showcasing a realistic protection implementation. Our results show that BPU undervolting can result in power savings up to 69%, while the microprocessor energy savings can be up to 12%, before the penalty of the performance degradation overcomes the benefits of low voltage. Neither smaller predictor sizes nor protection mechanisms can further improve energy consumption.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14