2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献_第2页

Improving Branch Prediction and Predicated Execution in Out-of-Order Processors 乱序处理器中分支预测和预测执行的改进

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346186

E. Quiñones, Joan-Manuel Parcerisa, Antonio González

{"title":"Improving Branch Prediction and Predicated Execution in Out-of-Order Processors","authors":"E. Quiñones, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/HPCA.2007.346186","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346186","url":null,"abstract":"If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Exploiting Postdominance for Speculative Parallelization 利用后优势进行推测并行化

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346207

Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank

{"title":"Exploiting Postdominance for Speculative Parallelization","authors":"Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank","doi":"10.1109/HPCA.2007.346207","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346207","url":null,"abstract":"Task-selection policies are critical to the performance of any architecture that uses speculation to extract parallel tasks from a sequential thread. This paper demonstrates that the immediate postdominators of conditional branches provide a larger set of parallel tasks than existing task-selection heuristics, which are limited to programming language constructs (such as loops or procedure calls). Our evaluation shows that postdominance-based task selection achieves, on average, more than double the speedup of the best individual heuristic, and 33% more speedup than the best combination of heuristics. The specific contributions of this paper include, first, a description of task selection based on immediate post-dominance for a system that speculatively creates tasks. Second, our experimental evaluation demonstrates that existing task-selection heuristics based on loops, procedure calls, and if-else statements are all subsumed by compiler-generated immediate postdominators. Finally, by demonstrating that dynamic reconvergence prediction closely approximates immediate postdominator analysis, we show that the notion of immediate postdominators may also be useful in constructing dynamic task selection mechanisms","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130411122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines 行蒸馏:通过过滤缓存行中未使用的单词来增加缓存容量

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346202

Moinuddin K. Qureshi, M. A. Suleman, Y. Patt

引用次数: 84

A Domain-Specific On-Chip Network Design for Large Scale Cache Systems 面向大规模高速缓存系统的特定领域片上网络设计

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346209

Yuho Jin, Eun Jung Kim, K. H. Yum

{"title":"A Domain-Specific On-Chip Network Design for Large Scale Cache Systems","authors":"Yuho Jin, Eun Jung Kim, K. H. Yum","doi":"10.1109/HPCA.2007.346209","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346209","url":null,"abstract":"As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134304241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Concurrent Direct Network Access for Virtual Machine Monitors 虚拟机监视器的并发直接网络访问

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346208

Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann

{"title":"Concurrent Direct Network Access for Virtual Machine Monitors","authors":"Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann","doi":"10.1109/HPCA.2007.346208","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346208","url":null,"abstract":"This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 165

LogTM-SE: Decoupling Hardware Transactional Memory from Caches LogTM-SE:将硬件事务性内存与缓存解耦

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346204

Luke Yen, J. Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, M. Hill, M. Swift, D. Wood

引用次数: 365

Petascale Computing Research Challenges - A Manycore Perspective 千兆级计算研究挑战-多核视角

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346188

S. Pawlowski

引用次数: 2

HARD: Hardware-Assisted Lockset-based Race Detection 硬:硬件辅助的基于锁集的种族检测

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346191

Pin Zhou, R. Teodorescu, Yuanyuan Zhou

{"title":"HARD: Hardware-Assisted Lockset-based Race Detection","authors":"Pin Zhou, R. Teodorescu, Yuanyuan Zhou","doi":"10.1109/HPCA.2007.346191","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346191","url":null,"abstract":"The emergence of multicore architectures will lead to an increase in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Software solutions for detecting data races generally incur large overheads. Hardware support for race detection can significantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during the monitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues such as storing the lockset information and performing complex set operations, so far it has been implemented only in software with 10-30 times performance hit. This paper proposes the first hardware implementation (called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal overhead. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bitwise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomly injected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20% more than happens-before, with only 0.1-2.6% of execution overhead. We also show our hardware design is cost-effective by comparing with the ideal lockset implementation, which would require a large amount of hardware resources","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 163

Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures 基于令牌一致性签名的缓存一致性在线检测错误

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346193

A. Meixner, Daniel J. Sorin

引用次数: 34

Colorama: Architectural Support for Data-Centric Synchronization Colorama:以数据为中心的同步的架构支持

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346192

L. Ceze, Pablo Montesinos, C. V. Praun, J. Torrellas

引用次数: 44