Proceedings of the 40th Annual International Symposium on Computer Architecture最新文献_第3页

QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs QuickRec:一个用于记录和回放多线程程序的英特尔架构扩展原型

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485977

Gilles A. Pokam, Klaus Danne, C. Pereira, R. Kassa, T. Kranich, Shiliang Hu, Justin Emile Gottschlich, N. Honarmand, Nathan Dautenhahn, Samuel T. King, J. Torrellas

{"title":"QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs","authors":"Gilles A. Pokam, Klaus Danne, C. Pereira, R. Kassa, T. Kranich, Shiliang Hu, Justin Emile Gottschlich, N. Honarmand, Nathan Dautenhahn, Samuel T. King, J. Torrellas","doi":"10.1145/2485922.2485977","DOIUrl":"https://doi.org/10.1145/2485922.2485977","url":null,"abstract":"There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86914739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

SIMD divergence optimization through intra-warp compaction 通过曲内压缩优化SIMD散度

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485954

A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi

{"title":"SIMD divergence optimization through intra-warp compaction","authors":"A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi","doi":"10.1145/2485922.2485954","DOIUrl":"https://doi.org/10.1145/2485922.2485954","url":null,"abstract":"SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87117889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates ArchShield:通过容忍高错误率来协助DRAM扩展的架构框架

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485929

Prashant J. Nair, Dae-Hyun Kim, Moinuddin K. Qureshi

{"title":"ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates","authors":"Prashant J. Nair, Dae-Hyun Kim, Moinuddin K. Qureshi","doi":"10.1145/2485922.2485929","DOIUrl":"https://doi.org/10.1145/2485922.2485929","url":null,"abstract":"DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates. To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"224 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89043343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 170

Proceedings of the 40th Annual International Symposium on Computer Architecture 第40届计算机体系结构国际研讨会论文集

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922

A. Mendelson

引用次数: 4

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior 临界栈:在使用同步行为的并行程序中识别临界线程

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485966

Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, L. Eeckhout

{"title":"Criticality stacks: identifying critical threads in parallel programs using synchronization behavior","authors":"Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, L. Eeckhout","doi":"10.1145/2485922.2485966","DOIUrl":"https://doi.org/10.1145/2485922.2485966","url":null,"abstract":"Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware. In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance. To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75677757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

Improving virtualization in the presence of software managed translation lookaside buffers 在存在软件管理的翻译暂存缓冲区的情况下改进虚拟化

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485933

Xiaotao Chang, H. Franke, Y. Ge, Tao Liu, Kun Wang, J. Xenidis, Fei Chen, Yu Zhang

{"title":"Improving virtualization in the presence of software managed translation lookaside buffers","authors":"Xiaotao Chang, H. Franke, Y. Ge, Tao Liu, Kun Wang, J. Xenidis, Fei Chen, Yu Zhang","doi":"10.1145/2485922.2485933","DOIUrl":"https://doi.org/10.1145/2485922.2485933","url":null,"abstract":"Virtualization has become an important technology that is used across many platforms, particularly servers, to increase utilization, multi-tenancy and security. Virtualization introduces additional overhead that often relates to memory management, interrupt handling and hypervisor mode switching. Among those, memory management and translation lookaside buffer (TLB) management have been shown to have a significant impact on the performance of systems. Two principal mechanisms for TLB management exist in today's systems, namely software and hardware managed TLBs. In this paper, we analyze and quantify the overhead of a pure software virtualization that is implemented over a software managed TLB. We then describe our design of hardware extensions to support virtualization in systems with software managed TLBs to remove the most dominant overheads. These extensions were implemented in the Power embedded A2 core, which is used in the PowerEN and in the Blue Gene/Q processors. They were used to implement a KVM port. We evaluate each of these hardware extensions to determine their overall contributions to performance and efficiency. Collectively these extensions demonstrate an average improvement of 232% over a pure software implementation.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"244 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84622717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

AC-DIMM: associative computing with STT-MRAM AC-DIMM:与STT-MRAM关联计算

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485939

Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, E. Friedman

{"title":"AC-DIMM: associative computing with STT-MRAM","authors":"Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, E. Friedman","doi":"10.1145/2485922.2485939","DOIUrl":"https://doi.org/10.1145/2485922.2485939","url":null,"abstract":"With technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the face of bandwidth and power limitations is to rely on associative memory systems. Recent work on a PCM-based, associative TCAM accelerator shows that associative search capability can reduce both off-chip bandwidth demand and overall system energy. Unfortunately, previously proposed resistive TCAM accelerators have limited flexibility: only a restricted (albeit important) class of applications can benefit from a TCAM accelerator, and the implementation is confined to resistive memory technologies with a high dynamic range (RHigh/RLow), such as PCM. This work proposes AC-DIMM, a flexible, high-performance associative compute engine built on a DDR3-compatible memory module. AC-DIMM addresses the limited flexibility of previous resistive TCAM accelerators by combining two powerful capabilities---associative search and processing in memory. Generality is improved by augmenting a TCAM system with a set of integrated, user programmable microcontrollers that operate directly on search results, and by architecting the system such that key-value pairs can be co-located in the same TCAM row. A new, bit-serial TCAM array is proposed, which enables the system to be implemented using STT-MRAM. AC-DIMM achieves a 4.2X speedup and a 6.5X energy reduction over a conventional RAM-based system on a set of 13 evaluated applications.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"1 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72618512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

Secure I/O device sharing among virtual machines on multiple hosts 多台主机上的虚拟机之间的安全I/O设备共享

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485932

Cheng-Chun Tu, Chao-Tang Lee, T. Chiueh

引用次数: 22

LINQits: big data on little clients linqit:小客户端的大数据

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485945

Eric S. Chung, John D. Davis, Jaewon Lee

引用次数: 102

Orchestrated scheduling and prefetching for GPGPUs 对gpgpu进行编排调度和预取

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485951

Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, R. Iyer, C. Das

{"title":"Orchestrated scheduling and prefetching for GPGPUs","authors":"Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, R. Iyer, C. Das","doi":"10.1145/2485922.2485951","DOIUrl":"https://doi.org/10.1145/2485922.2485951","url":null,"abstract":"In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future \"far-ahead\" warp while executing the current warp. We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79513713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 194