{"title":"StimulusCache: Boosting performance of chip multiprocessors with excess cache","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/HPCA.2010.5416644","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416644","url":null,"abstract":"Technology advances continuously shrink on-chip devices. Consequently, the number of cores in a single chip multiprocessor (CMP) is expected to grow in coming years. Unfortunately, with smaller device size and greater integration, chip yield degrades significantly. Guaranteeing that all chip components function correctly leads to an unrealistically low yield. Chip vendors have adopted a design strategy to market partially functioning processor chips to combat this problem. The two major components in a multicore chip are compute cores and on-chip memory such as L2 cache. From the viewpoint of the chip yield, the compute cores have a much lower yield than the on-chip memory due to their logic complexity and well-established memory yield enhancing techniques. Therefore, future CMPs are expected to have more available on-chip memories than working cores. This paper introduces a novel on-chip memory utilization scheme called StimulusCache, which decouples the L2 caches of faulty compute cores and employs them to assist applications on other working cores. Our extensive experimental evaluation demonstrates that StimulusCache significantly improves the performance of both single-threaded and multithreaded workloads.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123827883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Value Based BTB Indexing for indirect jump prediction","authors":"M. U. Farooq, Lei Chen, L. John","doi":"10.1109/HPCA.2010.5416659","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416659","url":null,"abstract":"History-based branch direction predictors for conditional branches are shown to be highly accurate. Indirect branches however, are hard to predict as they may have multiple targets corresponding to a single indirect branch instruction. We propose the Value Based BTB Indexing (VBBI), a correlation-based target address prediction scheme for indirect jump instructions. For each static hard-to-predict indirect jump instruction, the compiler identifies a ‘hint instruction’, whose output value strongly correlates with the target address of the indirect jump instruction. At run time, multiple target addresses of the indirect jump instruction are stored and subsequently accessed from the BTB at different indices computed using the jump instruction PC and the hint instruction output values. In case the hint instruction has not finished its execution when the jump instruction is fetched, a second and more accurate target address prediction is made when the hint instruction output is available, thus reducing the jump misprediction penalty. We compare our design to the regular BTB design and the best previously proposed indirect jump predictor, the tagged target cache (TTC). Our evaluation shows that the VBBI scheme improves the indirect jump target prediction accuracy by 48% and 18%, compared with the baseline BTB and TTC designs, respectively. This results in average performance improvement of 16.4% over the baseline BTB scheme, and 13% improvement over the TTC predictor. Out of this performance improvement 2% is contributed by target prediction overriding which is accurate 96% of the time.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134259307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tong Li, P. Brett, Rob C. Knauerhase, David A. Koufaty, D. Reddy, Scott Hahn
{"title":"Operating system support for overlapping-ISA heterogeneous multi-core architectures","authors":"Tong Li, P. Brett, Rob C. Knauerhase, David A. Koufaty, D. Reddy, Scott Hahn","doi":"10.1109/HPCA.2010.5416660","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416660","url":null,"abstract":"A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a design provides a cost-effective solution for processor manufacturers to continuously improve both single-thread performance and multi-thread throughput. This design, however, faces significant challenges in the operating system, which traditionally assumes only homogeneous hardware. This paper presents a comprehensive study of OS support for heterogeneous architectures in which cores have asymmetric performance and overlapping, but non-identical instruction sets. Our algorithms allow applications to transparently execute and fairly share different types of cores. We have implemented these algorithms in the Linux 2.6.24 kernel and evaluated them on an actual heterogeneous platform. Evaluation results demonstrate that our designs efficiently manage heterogeneous hardware and enable significant performance improvements for a range of applications.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133310848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori
{"title":"Designing a processor from the ground up to allow voltage/reliability tradeoffs","authors":"A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori","doi":"10.1109/HPCA.2010.5416652","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416652","url":null,"abstract":"Current processor designs have a critical operating point that sets a hard limit on voltage scaling. Any scaling beyond the critical voltage results in exceeding the maximum allowable error rate, i.e., there are more timing errors than can be effectively and gainfully detected or corrected by an error-tolerance mechanism. This limits the effectiveness of voltage scaling as a knob for reliability/power tradeoffs. In this paper, we present power-aware slack redistribution, a novel design-level approach to allow voltage/reliability tradeoffs in processors. Techniques based on power-aware slack redistribution reapportion timing slack of the frequently-occurring, near-critical timing paths of a processor in a power- and area-efficient manner, such that we increase the range of voltages over which the incidence of operational (timing) errors is acceptable. This results in soft architectures — designs that fail gracefully, allowing us to perform reliability/power tradeoffs by reducing voltage up to the point that produces maximum allowable errors for our application. The goal of our optimization is to minimize the voltage at which a soft architecture encounters the maximum allowable error rate, thus maximizing the range over which voltage scaling is possible and minimizing power consumption for a given error rate. Our experiments demonstrate 23% power savings over the baseline design at an error rate of 1%. Observed power reductions are 29%, 29%, 19%, and 20% for error rates of 2%, 4%, 8%, and 16% respectively. Benefits are higher in the face of error recovery using Razor. Area overhead of our techniques is up to 2.7%.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129623779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HARE: Hardware assisted reverse execution","authors":"Ioannis Doudalis, Milos Prvulović","doi":"10.1109/HPCA.2010.5416651","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416651","url":null,"abstract":"Bidirectional execution is a powerful debugging technique that allows program execution to proceed both forward and in reverse. Many software-only techniques and tools have emerged that use checkpointing and replay to provide the effect of reverse execution, although with considerable performance overheads in both forward and reverse execution. Recent hardware proposals for checkpointing and execution replay minimize these performance overheads, but in a way that prevents checkpoint consolidation, a key technique for reducing memory use while retaining the ability to reverse long periods of execution. This paper presents HARE, a hardware technique that efficiently supports both checkpointing and consolidation. Our experiments show that on average HARE incurs <3% performace overheads even when creating tens of checkpoints per second, provides reverse execution times similar to forward execution times, and reduces the total space used by checkpoints by a factor of 36 on average (this factor gets better for longer runs) relative to prior consolidation-less hardware checkpointing schemes.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116740997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moinuddin K. Qureshi, M. Franceschini, L. A. Lastras-Montaño
{"title":"Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing","authors":"Moinuddin K. Qureshi, M. Franceschini, L. A. Lastras-Montaño","doi":"10.1109/HPCA.2010.5416645","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416645","url":null,"abstract":"Phase Change Memory (PCM) is emerging as a promising technology to build large-scale main memory systems in a cost-effective manner. A characteristic of PCM is that it has write latency much higher than read latency. A higher write latency can typically be tolerated using buffers. However, once a write request is scheduled for service to a bank, it can still cause increased latency for later arriving read requests to the same bank. We show that for the baseline PCM system with read-priority scheduling, the write requests increase the effective read latency to 2.3x (on average), causing significant performance degradation. To reduce the read latency of PCM devices under such scenarios, we propose adaptive Write Cancellation policies. Such policies can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period. We also propose Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads. For the baseline system, the proposed technique removes 75% of the latency increase incurred by read requests and improves overall system performance by 46% (on average), while requiring negligible hardware and simple extensions to PCM controller.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124950855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IADVS: On-demand performance for interactive applications","authors":"Mingsong Bi, Igor Crk, C. Gniady","doi":"10.1109/HPCA.2010.5416649","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416649","url":null,"abstract":"Increasingly power-hungry processors have reinforced the need for aggressive power management. Dynamic voltage scaling has become a common design consideration allowing for energy efficient CPUs by matching CPU performance with the computational demand of running processes. In this paper, we propose Interaction-Aware Dynamic Voltage Scaling (IADVS), a novel fine-grained approach to managing CPU power during interactive workloads, which account for the bulk of the processing demand on modern mobile or desktop systems. IADVS is built upon a transparent, fine-grained interaction capture system. Able to track CPU usage for each user interface event, the proposed system sets the CPU performance level to the one that best matches the predicted CPU demand. Compared to the state-of-the-art approach of user-interaction-based CPU energy management, we show that IADVS improves prediction accuracy by 37%, reduces processing delays by 17%, and reduces energy consumed of the CPU by as much as 4%. The proposed design is evaluated with both a detailed trace-based simulation as well as implementation on a real system, verifying the simulation findings.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121572476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bogdan F. Romanescu, A. Lebeck, Daniel J. Sorin, Anne Bracy
{"title":"UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all","authors":"Bogdan F. Romanescu, A. Lebeck, Daniel J. Sorin, Anne Bracy","doi":"10.1109/HPCA.2010.5416643","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416643","url":null,"abstract":"We propose UNITD, a unified hardware coherence framework that integrates translation coherence into the existing cache coherence protocol. In UNITD coherence protocols, the TLBs participate in the cache coherence protocol just like the instruction and data caches, without requiring any changes to the existing coherence protocol. UNITD eliminates the need for the software TLB shootdown routine, a procedure known to be performance costly and non-scalable. We evaluate snooping and directory UNITD coherence protocols on multicore processors with 2–16 cores, and we demonstrate that UNITD reduces the performance penalty associated with TLB coherence to almost zero.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124228856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaejin Lee, Jun Lee, Sangmin Seo, Jungwon Kim, Seungkyun Kim, Zehra Sura
{"title":"COMIC++: A software SVM system for heterogeneous multicore accelerator clusters","authors":"Jaejin Lee, Jun Lee, Sangmin Seo, Jungwon Kim, Seungkyun Kim, Zehra Sura","doi":"10.1109/HPCA.2010.5416633","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416633","url":null,"abstract":"In this paper, we propose a software shared virtual memory (SVM) system for heterogeneous multicore accelerator clusters with explicitly managed memory hierarchies. The target cluster consists of a single manager node and many compute nodes. The manager node contains a generalpurpose processor and larger main memory, and each compute node contains a heterogeneous multicore processor and smaller main memory. These nodes are connected with an interconnection network, such as Gigabit Ethernet. The heterogeneous multicore processor in each compute node consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE runs an OS and the multiple APEs are dedicated to compute-intensive workloads. The GPE is typically backed by a deep on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small explicitly-addressed local memory instead of caches. This APE local memory is not coherent with the main memory. Different main and local memory units in the accelerator cluster can be viewed as an explicitly managed memory hierarchy: global memory, node local memory, and APE local memory. Since coherence protocols of previous software SVM proposals cannot effectively handle such a memory hierarchy, we propose a new coherence and consistency protocol, called hierarchical centralized release consistency (HCRC). Our software SVM system is built on top of HCRC and software-managed caches. We evaluate the effectiveness and analyze the performance of our software SVM system on a 32-node heterogeneous multicore cluster (a total of 192 APEs).","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115339477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores","authors":"Brian Greskamp, Ulya R. Karpuzcu, J. Torrellas","doi":"10.1109/HPCA.2010.5416656","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416656","url":null,"abstract":"Despite the ubiquity of multicores, it is as important as ever to deliver high single-thread performance. An appealing way to accomplish this is by shutting down the idle cores in the chip and running the busy, performance-critical core(s) at higher-than-nominal frequencies. To enable such frequencies, two low-overhead approaches either boost voltage beyond nominal values, or pair cores in leader-checker configurations and let them run beyond safe frequency margins. We observe that, in a large multicore with varying numbers of busy cores, individual application of either of these two techniques is suboptimal. Each alone is often unable to bring the multicore all the way to its power or temperature envelopes due to limitations in supply voltage or error rate. Moreover, we show that the two techniques are complementary, and can be synergistically combined to unlock much higher levels of single-thread performance. Finally, we demonstrate a dynamic controller that optimizes the two techniques. Our data shows that, given a 16-core multi-core where half of the cores are already busy, an additional, performance-critical thread now attains 34% higher performance than before, while consuming 220% more power.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122010082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}