2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献

筛选
英文 中文
Improving Branch Prediction and Predicated Execution in Out-of-Order Processors 乱序处理器中分支预测和预测执行的改进
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346186
E. Quiñones, Joan-Manuel Parcerisa, Antonio González
{"title":"Improving Branch Prediction and Predicated Execution in Out-of-Order Processors","authors":"E. Quiñones, Joan-Manuel Parcerisa, Antonio González","doi":"10.1109/HPCA.2007.346186","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346186","url":null,"abstract":"If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Exploiting Postdominance for Speculative Parallelization 利用后优势进行推测并行化
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346207
Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank
{"title":"Exploiting Postdominance for Speculative Parallelization","authors":"Mayank Agarwal, Kshitiz Malik, Kevin M. Woley, S. S. Stone, M. Frank","doi":"10.1109/HPCA.2007.346207","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346207","url":null,"abstract":"Task-selection policies are critical to the performance of any architecture that uses speculation to extract parallel tasks from a sequential thread. This paper demonstrates that the immediate postdominators of conditional branches provide a larger set of parallel tasks than existing task-selection heuristics, which are limited to programming language constructs (such as loops or procedure calls). Our evaluation shows that postdominance-based task selection achieves, on average, more than double the speedup of the best individual heuristic, and 33% more speedup than the best combination of heuristics. The specific contributions of this paper include, first, a description of task selection based on immediate post-dominance for a system that speculatively creates tasks. Second, our experimental evaluation demonstrates that existing task-selection heuristics based on loops, procedure calls, and if-else statements are all subsumed by compiler-generated immediate postdominators. Finally, by demonstrating that dynamic reconvergence prediction closely approximates immediate postdominator analysis, we show that the notion of immediate postdominators may also be useful in constructing dynamic task selection mechanisms","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130411122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines 行蒸馏:通过过滤缓存行中未使用的单词来增加缓存容量
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346202
Moinuddin K. Qureshi, M. A. Suleman, Y. Patt
{"title":"Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines","authors":"Moinuddin K. Qureshi, M. A. Suleman, Y. Patt","doi":"10.1109/HPCA.2007.346202","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346202","url":null,"abstract":"Caches are organized at a line-size granularity to exploit spatial locality. However, when spatial locality is low, many words in the cache line are not used. Unused words occupy cache space but do not contribute to cache hits. Filtering these words can allow the cache to store more cache lines. We show that unused words in a cache line are unlikely to be accessed in the less recent part of the LRU stack. We propose line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line. We also propose distill cache, a cache organization to utilize the capacity created by LDIS. Our experiments with 16 memory-intensive benchmarks show that LDIS reduces the average misses for a 1MB 8-way L2 cache by 30% and improves the average IPC by 12%","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116854772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
A Domain-Specific On-Chip Network Design for Large Scale Cache Systems 面向大规模高速缓存系统的特定领域片上网络设计
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346209
Yuho Jin, Eun Jung Kim, K. H. Yum
{"title":"A Domain-Specific On-Chip Network Design for Large Scale Cache Systems","authors":"Yuho Jin, Eun Jung Kim, K. H. Yum","doi":"10.1109/HPCA.2007.346209","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346209","url":null,"abstract":"As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134304241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Concurrent Direct Network Access for Virtual Machine Monitors 虚拟机监视器的并发直接网络访问
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346208
Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann
{"title":"Concurrent Direct Network Access for Virtual Machine Monitors","authors":"Jeffrey Shafer, D. Carr, Aravind Menon, S. Rixner, A. Cox, W. Zwaenepoel, Paul Willmann","doi":"10.1109/HPCA.2007.346208","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346208","url":null,"abstract":"This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129209752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 165
LogTM-SE: Decoupling Hardware Transactional Memory from Caches LogTM-SE:将硬件事务性内存与缓存解耦
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346204
Luke Yen, J. Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, M. Hill, M. Swift, D. Wood
{"title":"LogTM-SE: Decoupling Hardware Transactional Memory from Caches","authors":"Luke Yen, J. Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, M. Hill, M. Swift, D. Wood","doi":"10.1109/HPCA.2007.346204","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346204","url":null,"abstract":"This paper proposes a hardware transactional memory (HTM) system called LogTM Signature Edition (LogTM-SE). LogTM-SE uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection). Transactions update memory \"in place\" after saving the old value in a per-thread memory log (eager version management). Finally, a transaction commits locally by clearing its signature, resetting the log pointer, etc., while aborts must undo the log. LogTM-SE achieves two key benefits. First, signatures and logs can be implemented without changes to highly-optimized cache arrays because LogTM-SE never moves cached data, changes a blocks cache state, or flash clears bits in the cache. Second, transactions are more easily virtualized because signatures and logs are software accessible, allowing the operating system and runtime to save and restore this state. In particular, LogTM-SE allows cache victimization, unbounded nesting (both open and closed), thread context switching and migration, and paging","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133845292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 365
Petascale Computing Research Challenges - A Manycore Perspective 千兆级计算研究挑战-多核视角
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346188
S. Pawlowski
{"title":"Petascale Computing Research Challenges - A Manycore Perspective","authors":"S. Pawlowski","doi":"10.1109/HPCA.2007.346188","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346188","url":null,"abstract":"Summary form only given. Future high performance computing will undoubtedly reach Petascale and beyond. Today's HPC is tomorrow's personal computing. What are the evolving processor architectures towards multi-core and many-core for the best performance per watt; memory bandwidth solutions to feed the ever more powerful processors; intra-chip interconnect options for optimal bandwidth vs. power? With Moore's Law continuing to prove its viability and shrinking transistors' geometry, improving reliability is even more challenging. Intel Senior Fellow and Chief Technology Officer of Intel's Digital Enterprise Group, Steve Pawlowski, will provide his technology vision, insight and research challenges to achieve the vision of Petascale computing and beyond","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132665043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HARD: Hardware-Assisted Lockset-based Race Detection 硬:硬件辅助的基于锁集的种族检测
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346191
Pin Zhou, R. Teodorescu, Yuanyuan Zhou
{"title":"HARD: Hardware-Assisted Lockset-based Race Detection","authors":"Pin Zhou, R. Teodorescu, Yuanyuan Zhou","doi":"10.1109/HPCA.2007.346191","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346191","url":null,"abstract":"The emergence of multicore architectures will lead to an increase in the use of multithreaded applications that are prone to synchronization bugs, such as data races. Software solutions for detecting data races generally incur large overheads. Hardware support for race detection can significantly reduce that overhead. However, all existing hardware proposals for race detection are based on the happens-before algorithm which is sensitive to thread interleaving and cannot detect races that are not exposed during the monitored run. The lockset algorithm addresses this limitation. Unfortunately, due to the challenging issues such as storing the lockset information and performing complex set operations, so far it has been implemented only in software with 10-30 times performance hit. This paper proposes the first hardware implementation (called HARD) of the lockset algorithm to exploit the race detection capability of this algorithm with minimal overhead. HARD efficiently stores lock sets in hardware bloom filters and converts the expensive set operations into fast bitwise logic operations with negligible overhead. We evaluate HARD using six SPLASH-2 applications with 60 randomly injected bugs. Our results show that HARD can detect 54 out of 60 tested bugs, 20% more than happens-before, with only 0.1-2.6% of execution overhead. We also show our hardware design is cost-effective by comparing with the ideal lockset implementation, which would require a large amount of hardware resources","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 163
Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures 基于令牌一致性签名的缓存一致性在线检测错误
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346193
A. Meixner, Daniel J. Sorin
{"title":"Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures","authors":"A. Meixner, Daniel J. Sorin","doi":"10.1109/HPCA.2007.346193","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346193","url":null,"abstract":"To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explored dynamic checking of cache coherence as a comprehensive approach to memory system error detection. However, existing coherence checkers are costly to implement, incur high interconnection network traffic overhead, and do not scale well. In this paper, we describe the token coherence signature checker (TCSC), which provides comprehensive, low-cost, scalable coherence checking by maintaining signatures that represent recent histories of coherence events at all nodes (cache and memory controllers). Periodically, these signatures are sent to a verifier to determine if an error occurred. TCSC has a small constant hardware cost per node, independent of cache and memory size and the number of nodes. TCSC's interconnect bandwidth overhead has a constant upper bound and never exceeds 7% in our experiments. TCSC has negligible impact on system performance","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122862092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Colorama: Architectural Support for Data-Centric Synchronization Colorama:以数据为中心的同步的架构支持
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346192
L. Ceze, Pablo Montesinos, C. V. Praun, J. Torrellas
{"title":"Colorama: Architectural Support for Data-Centric Synchronization","authors":"L. Ceze, Pablo Montesinos, C. V. Praun, J. Torrellas","doi":"10.1109/HPCA.2007.346192","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346192","url":null,"abstract":"With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely synchronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the programmer still needs nonlocal reasoning to decide where to place transactions in the code. A significant improvement to the art is data-centric synchronization (DCS), where the programmer uses local reasoning to assign synchronization constraints to data. Based on these, the system automatically infers critical sections and inserts synchronization operations. This paper proposes novel architectural support to make DCS feasible, and describes its programming model and interface. The proposal, called Colorama, needs only modest hardware extensions, supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially simplify the task of writing new parallel programs","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129919234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信