2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

筛选
英文 中文
On the Impact of Instruction Address Translation Overhead 论指令地址转换开销的影响
Yufeng Zhou, Xiaowan Dong, A. Cox, S. Dwarkadas
{"title":"On the Impact of Instruction Address Translation Overhead","authors":"Yufeng Zhou, Xiaowan Dong, A. Cox, S. Dwarkadas","doi":"10.1109/ISPASS.2019.00018","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00018","url":null,"abstract":"Even on modern processors with their ever larger instruction translation lookaside buffers (TLBs), we find that a variety of widely used applications, ranging from compilers to web user-interface frameworks, suffer from high instruction address translation overheads. In this paper, we explore the efficacy of different operating system-level approaches to automatically reducing this instruction address translation overhead. Specifically, we evaluate the use of automatic superpage promotion and page table sharing as well as a transparent padding mechanism that enables small code regions to be mapped using superpages. Overall, we find that the combined effects of these different approaches can reduce an application's total execution cycles by up to 18%. Surprisingly, we find that improving address translation performance in the first-level instruction TLB can significantly reduce the address translation overhead for data accesses. The overall reduction in execution cycles is more than double the instruction address translation overhead on stock FreeBSD, demonstrating that data address translation and access synergistically benefit from less contention in the caches and TLBs that might be shared across instruction and data.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134151868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
2019 IEEE International Symposium on Performance Analysis of Systems and Software 2019 IEEE系统与软件性能分析国际研讨会
L. Eeckhout, David Brooks
{"title":"2019 IEEE International Symposium on Performance Analysis of Systems and Software","authors":"L. Eeckhout, David Brooks","doi":"10.1109/ispass.2009.4919624","DOIUrl":"https://doi.org/10.1109/ispass.2009.4919624","url":null,"abstract":"","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116406658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators HeteroMap:异构多加速器上图形分析高效处理的运行时性能预测器
Masab Ahmad, H. Dogan, Christopher J. Michael, O. Khan
{"title":"HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators","authors":"Masab Ahmad, H. Dogan, Christopher J. Michael, O. Khan","doi":"10.1109/ISPASS.2019.00039","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00039","url":null,"abstract":"With the ever-increasing amount of data and input variations, portable performance is becoming harder to exploit on today's architectures. Computational setups utilize single-chip processors, such as GPUs or large-scale multicores for graph analytics. Some algorithm-input combinations perform more efficiently when utilizing a GPU's higher concurrency and bandwidth, while others perform better with a multicore's stronger data caching capabilities. Architectural choices also occur within selected accelerators, where variables such as threading and thread placement need to be decided for optimal performance. This paper proposes a performance predictor paradigm for a heterogeneous parallel architecture where multiple disparate accelerators are integrated in an operational high performance computing setup. The predictor aims to improve graph processing efficiency by exploiting the underlying concurrency variations within and across the heterogeneous integrated accelerators using graph benchmark and input characteristics. The evaluation shows that intelligent and real-time selection of near-optimal concurrency choices provides performance benefits ranging from 5 % to 3.8 x, and an energy benefit averaging around 2.4 x over the traditional single-accelerator setup.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123440388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A Detailed Model for Contemporary GPU Memory Systems 当代GPU内存系统的详细模型
Mahmoud Khairy, Akshay Jain, Tor M. Aamodt, Timothy G. Rogers
{"title":"A Detailed Model for Contemporary GPU Memory Systems","authors":"Mahmoud Khairy, Akshay Jain, Tor M. Aamodt, Timothy G. Rogers","doi":"10.1109/ISPASS.2019.00023","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00023","url":null,"abstract":"This paper explores the impact of simulator accuracy on architecture design decisions in the general-purpose graphics processing unit (GPGPU) space. We enhance the most popular publicly available GPU simulator, GPGPU-Sim, by performing a rigorous correlation of the simulator with a contemporary GPU. Our enhanced GPU model is able to describe the NVIDIA Volta architecture in sufficient detail to reduce error in memory system counters by as much as 66×. The reduced error in the memory system further reduces execution time error by 2.5×. To demonstrate the accuracy of our enhanced model against a real machine, we perform a counter-by-counter validation against an NVIDIA TITAN V Volta GPU, demonstrating the relative accuracy of the new simulator versus the previous model. We go on to demonstrate that the simpler model discounts the importance of advanced memory system designs such as out-of-order memory access scheduling. Our results demonstrate that it is important for the academic community to enhance the level of detail in architecture simulators as system complexity continues to grow.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122414315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The POP Detector: A Lightweight Online Program Phase Detection Framework POP检测器:一个轻量级的在线程序相位检测框架
Karl Taht, James Greensky, R. Balasubramonian
{"title":"The POP Detector: A Lightweight Online Program Phase Detection Framework","authors":"Karl Taht, James Greensky, R. Balasubramonian","doi":"10.1109/ISPASS.2019.00013","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00013","url":null,"abstract":"Real-time phase detection enables dynamic adaptation of systems based on different program behavior. Many phase detection techniques have been proposed, with the most successful relating the phases back to application code. In the scope of online phase detection, techniques employ sampling to mitigate the overheads of the phase detection framework. When phase intervals are long enough, sampling approaches perform well. We reopen the question of phase interval length by performing in-depth analysis on the trade-offs between overhead and phase detector performance. We present a new metric which captures the statistical trade-off between phase interval length, phase stability, and the number of phases. We find that while shorter phases perform best in the context of online optimization, existing implementations suffer from performance degradation and overhead at shorter interval sizes. To address this gap, we present the Precise Online Phase (POP) detector. The POP detector utilizes performance counters to build signatures, which are virtually lossless at finer granularity. As a second order, the simplicity of the detector reduces the runtime overhead to just 1.35% and 0.09% at 10M and 100M instruction intervals, respectively.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator mRNA:实现重构神经加速器的高效映射空间探索
Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, T. Krishna
{"title":"mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator","authors":"Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, T. Krishna","doi":"10.1109/ISPASS.2019.00040","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00040","url":null,"abstract":"Deep learning accelerators have emerged to enable energy-efficient and high-throughput inference from edge devices such as self-driving cars and smartphones, to data centers for batch inference such as recommendation systems. However, the actual energy efficiency and throughput of a deep learning accelerator depends on the deep neural network (DNN) loop nest mapping on the processing element array of an accelerator. Moreover, the efficiency of a mapping dramatically changes by the target DNN layer dimensions and available hardware resources. Therefore, the optimal mapping search problem is a non-trivial high-dimensional optimization problem. Although several tools and frameworks exist for compiling to CPUs and GPUs, we lack similar tools for deep learning accelerators. To deal with the optimized mapping search problem in deep learning accelerators, we propose mRNA (mapper for reconfigurable neural accelerators), which automatically searches optimal mappings using heuristics based on domain knowledge about deep learning and an energy/runtime cost evaluation framework. mRNA targets MAERI, a recently proposed open-source deep learning accelerator that provides flexibility via reconfigurable interconnects, to run the unique mappings for each layer generated by mRNA. In realistic machine learning workloads from MLPerf, the optimal mappings identified by mRNA framework provides 15% to 26% lower runtime and 55% to 64% lower energy for convolutional layers and 24% to 67% lower runtime and maximum 67% lower energy for fully connected layers compared to simple reference mappings manually picked for each layer.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131042910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Emulating and Evaluating Hybrid Memory for Managed Languages on NUMA Hardware 基于NUMA硬件的托管语言混合内存仿真与评价
Shoaib Akram, Jennifer B. Sartor, K. McKinley, L. Eeckhout
{"title":"Emulating and Evaluating Hybrid Memory for Managed Languages on NUMA Hardware","authors":"Shoaib Akram, Jennifer B. Sartor, K. McKinley, L. Eeckhout","doi":"10.1109/ISPASS.2019.00017","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00017","url":null,"abstract":"Non-volatile memory (NVM) has the potential to become a mainstream memory technology and challenge DRAM. Researchers evaluating the speed, endurance, and abstractions of hybrid memories with DRAM and NVM typically use simulation, making it easy to evaluate the impact of different hardware technologies and parameters. Simulation is, however, extremely slow, limiting the applications and datasets in the evaluation. Simulation also precludes critical workloads, especially those written in managed languages such as Java and C#. Good methodology embraces a variety of techniques for evaluating new ideas, expanding the experimental scope, and uncovering new insights. This paper introduces a platform to emulate hybrid memory for managed languages using commodity NUMA servers. Emulation complements simulation but offers richer software experimentation. We use a thread-local socket to emulate DRAM and a remote socket to emulate NVM. We use standard C library routines to allocate heap memory on the DRAM and NVM sockets for use with explicit memory management or garbage collection. We evaluate the emulator using various configurations of write-rationing garbage collectors that improve NVM lifetimes by limiting writes to NVM, using 15 applications and various datasets and workload configurations. We show emulation and simulation confirm each other's trends in terms of writes to NVM for different software configurations, increasing our confidence in predicting future system effects. Emulation brings novel insights, such as the non-linear effects of multi-programmed workloads on NVM writes, and that Java applications write significantly more than their C++ equivalents. We make our software infrastructure publicly available to advance the evaluation of novel memory management schemes on hybrid memories.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132449687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Distributed Software Defined Networking Controller Failure Mode and Availability Analysis 分布式软件定义网络控制器故障模式及可用性分析
P. Reeser, Guilhem Tesseyre, Marcus Callaway
{"title":"Distributed Software Defined Networking Controller Failure Mode and Availability Analysis","authors":"P. Reeser, Guilhem Tesseyre, Marcus Callaway","doi":"10.1109/ISPASS.2019.00035","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00035","url":null,"abstract":"Given the critical role Software Defined Networking controllers play in cloud computing and networking architectures, understanding their resiliency profile is crucial. Using OpenContrail as a reference architecture, we analyze the typical distributed controller failure modes and their effects on the control and data planes. We then develop hardware- and software-centric theoretical availability models for a variety of physical topologies and software modes of operation. These parametric models are used to predict availability and quantify sensitivity to underlying platform and process resiliency. The results suggest that the distributed control plane can achieve very high availability, while the host data plane may achieve much lower availability due to inherent single points of failure.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133484658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Demystifying Crypto-Mining: Analysis and Optimizations of Memory-Hard PoW Algorithms 揭秘加密挖掘:内存硬PoW算法的分析与优化
Runchao Han, N. Foutris, Christos Kotselidis
{"title":"Demystifying Crypto-Mining: Analysis and Optimizations of Memory-Hard PoW Algorithms","authors":"Runchao Han, N. Foutris, Christos Kotselidis","doi":"10.1109/ISPASS.2019.00011","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00011","url":null,"abstract":"Blockchain technology has become extremely popular, during the last decade, mainly due to the successful application in the cryptocurrency domain. Following the explosion of Bitcoin and other cryptocurrencies, blockchain solutions are being deployed in almost every aspect of transactional operations as a means to safely exchange digital assets between non-trusted parties. At the heart of every blockchain deployment is the consensus protocol, which maintains the consistency of the blockchain upon satisfying incoming transactions. Although many consensus protocols have been recently introduced, the most prevalent is Proof-of- Work, which scales the blockchain globally by converting the consensus problem to a competition based on cryptographic hash functions; a process called “mining”. The Proof-of- Work consensus protocol employs memory-hard algorithms in order to counteract ASIC or FPGA mining that may compromise the decentralization and democratization of the blockchain. Unfortunately, this leads to increased power consumption and scalability challenges since numerous processing units such as GPUs, FPGAs, and ASICs, are required to satisfy the ever-increasing operational requirements of blockchain deployments. In this paper, we perform an in-depth performance analysis and characterization of the most common memory-hard PoW algorithms running on NVIDIA GPUs. Motivated by our experimental findings, we apply a series of optimizations on Ethash algorithm, the consensus protocol of the Ethereum blockchain. The implemented optimizations accelerate performance by 14% and improve energy efficiency by 10% when executing on three NVIDIA GPUs. As a result, the optimized Ethash algorithm outperformed its fastest commercial implementation.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127936450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Assessing the Effects of Low Voltage in Branch Prediction Units 评估低压对支路预测装置的影响
Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos, Shrikanth Ganapathy, J. Kalamatianos
{"title":"Assessing the Effects of Low Voltage in Branch Prediction Units","authors":"Athanasios Chatzidimitriou, G. Papadimitriou, D. Gizopoulos, Shrikanth Ganapathy, J. Kalamatianos","doi":"10.1109/ISPASS.2019.00020","DOIUrl":"https://doi.org/10.1109/ISPASS.2019.00020","url":null,"abstract":"Branch prediction units are key performance components in modern microprocessors as they are widely used to address control hazards and minimize misprediction stalls. The continuous urge of high performance has led designers to integrate highly sophisticated predictors with complex prediction algorithms and large storage requirements. As a result, BPUs in modern microprocessors consume large amounts of power. But when a system is under a limited power budget, critical decisions are required in order to achieve an equilibrium point between the BPU and the rest of the microprocessor. In this work, we present a comprehensive analysis of the effects of low voltage configuration Branch Prediction Units (BPU). We propose a design with separate voltage domain for the BPU, which exploits the speculative nature of the BPU (which is self-correcting) that allows reduction of power without affecting functional correctness. Our study explores how several branch predictor implementations behave when aggressively undervolted, the performance impact of BTB as well as in which cases it is more efficient to reduce the BP and BTB size instead of undervolting. We also show that protection of BPU SRAM arrays has limited potential to further increase the energy savings, showcasing a realistic protection implementation. Our results show that BPU undervolting can result in power savings up to 69%, while the microprocessor energy savings can be up to 12%, before the penalty of the performance degradation overcomes the benefits of low voltage. Neither smaller predictor sizes nor protection mechanisms can further improve energy consumption.","PeriodicalId":137786,"journal":{"name":"2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信