HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture最新文献_第2页

StimulusCache: Boosting performance of chip multiprocessors with excess cache 刺激缓存:提高芯片多处理器的性能与多余的缓存

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416644

Hyunjin Lee, Sangyeun Cho, B. Childers

{"title":"StimulusCache: Boosting performance of chip multiprocessors with excess cache","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/HPCA.2010.5416644","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416644","url":null,"abstract":"Technology advances continuously shrink on-chip devices. Consequently, the number of cores in a single chip multiprocessor (CMP) is expected to grow in coming years. Unfortunately, with smaller device size and greater integration, chip yield degrades significantly. Guaranteeing that all chip components function correctly leads to an unrealistically low yield. Chip vendors have adopted a design strategy to market partially functioning processor chips to combat this problem. The two major components in a multicore chip are compute cores and on-chip memory such as L2 cache. From the viewpoint of the chip yield, the compute cores have a much lower yield than the on-chip memory due to their logic complexity and well-established memory yield enhancing techniques. Therefore, future CMPs are expected to have more available on-chip memories than working cores. This paper introduces a novel on-chip memory utilization scheme called StimulusCache, which decouples the L2 caches of faulty compute cores and employs them to assist applications on other working cores. Our extensive experimental evaluation demonstrates that StimulusCache significantly improves the performance of both single-threaded and multithreaded workloads.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123827883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Value Based BTB Indexing for indirect jump prediction 用于间接跳转预测的基于值的BTB索引

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416659

M. U. Farooq, Lei Chen, L. John

{"title":"Value Based BTB Indexing for indirect jump prediction","authors":"M. U. Farooq, Lei Chen, L. John","doi":"10.1109/HPCA.2010.5416659","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416659","url":null,"abstract":"History-based branch direction predictors for conditional branches are shown to be highly accurate. Indirect branches however, are hard to predict as they may have multiple targets corresponding to a single indirect branch instruction. We propose the Value Based BTB Indexing (VBBI), a correlation-based target address prediction scheme for indirect jump instructions. For each static hard-to-predict indirect jump instruction, the compiler identifies a ‘hint instruction’, whose output value strongly correlates with the target address of the indirect jump instruction. At run time, multiple target addresses of the indirect jump instruction are stored and subsequently accessed from the BTB at different indices computed using the jump instruction PC and the hint instruction output values. In case the hint instruction has not finished its execution when the jump instruction is fetched, a second and more accurate target address prediction is made when the hint instruction output is available, thus reducing the jump misprediction penalty. We compare our design to the regular BTB design and the best previously proposed indirect jump predictor, the tagged target cache (TTC). Our evaluation shows that the VBBI scheme improves the indirect jump target prediction accuracy by 48% and 18%, compared with the baseline BTB and TTC designs, respectively. This results in average performance improvement of 16.4% over the baseline BTB scheme, and 13% improvement over the TTC predictor. Out of this performance improvement 2% is contributed by target prediction overriding which is accurate 96% of the time.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134259307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Operating system support for overlapping-ISA heterogeneous multi-core architectures 操作系统支持重叠isa异构多核架构

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416660

Tong Li, P. Brett, Rob C. Knauerhase, David A. Koufaty, D. Reddy, Scott Hahn

引用次数: 141

Designing a processor from the ground up to allow voltage/reliability tradeoffs 从头开始设计处理器，以允许电压/可靠性权衡

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416652

A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori

{"title":"Designing a processor from the ground up to allow voltage/reliability tradeoffs","authors":"A. Kahng, Seokhyeong Kang, Rakesh Kumar, J. Sartori","doi":"10.1109/HPCA.2010.5416652","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416652","url":null,"abstract":"Current processor designs have a critical operating point that sets a hard limit on voltage scaling. Any scaling beyond the critical voltage results in exceeding the maximum allowable error rate, i.e., there are more timing errors than can be effectively and gainfully detected or corrected by an error-tolerance mechanism. This limits the effectiveness of voltage scaling as a knob for reliability/power tradeoffs. In this paper, we present power-aware slack redistribution, a novel design-level approach to allow voltage/reliability tradeoffs in processors. Techniques based on power-aware slack redistribution reapportion timing slack of the frequently-occurring, near-critical timing paths of a processor in a power- and area-efficient manner, such that we increase the range of voltages over which the incidence of operational (timing) errors is acceptable. This results in soft architectures — designs that fail gracefully, allowing us to perform reliability/power tradeoffs by reducing voltage up to the point that produces maximum allowable errors for our application. The goal of our optimization is to minimize the voltage at which a soft architecture encounters the maximum allowable error rate, thus maximizing the range over which voltage scaling is possible and minimizing power consumption for a given error rate. Our experiments demonstrate 23% power savings over the baseline design at an error rate of 1%. Observed power reductions are 29%, 29%, 19%, and 20% for error rates of 2%, 4%, 8%, and 16% respectively. Benefits are higher in the face of error recovery using Razor. Area overhead of our techniques is up to 2.7%.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129623779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 90

HARE: Hardware assisted reverse execution 硬件辅助逆向执行

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416651

Ioannis Doudalis, Milos Prvulović

引用次数: 4

Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing 通过写取消和写暂停提高相变存储器的读性能

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416645

Moinuddin K. Qureshi, M. Franceschini, L. A. Lastras-Montaño

{"title":"Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing","authors":"Moinuddin K. Qureshi, M. Franceschini, L. A. Lastras-Montaño","doi":"10.1109/HPCA.2010.5416645","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416645","url":null,"abstract":"Phase Change Memory (PCM) is emerging as a promising technology to build large-scale main memory systems in a cost-effective manner. A characteristic of PCM is that it has write latency much higher than read latency. A higher write latency can typically be tolerated using buffers. However, once a write request is scheduled for service to a bank, it can still cause increased latency for later arriving read requests to the same bank. We show that for the baseline PCM system with read-priority scheduling, the write requests increase the effective read latency to 2.3x (on average), causing significant performance degradation. To reduce the read latency of PCM devices under such scenarios, we propose adaptive Write Cancellation policies. Such policies can abort the processing of a scheduled write requests if a read request arrives to the same bank within a predetermined period. We also propose Write Pausing, which exploits the iterative write algorithms used in PCM to pause at the end of each write iteration to service any pending reads. For the baseline system, the proposed technique removes 75% of the latency increase incurred by read requests and improves overall system performance by 46% (on average), while requiring negligible hardware and simple extensions to PCM controller.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124950855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 294

IADVS: On-demand performance for interactive applications IADVS:交互式应用程序的按需性能

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416649

Mingsong Bi, Igor Crk, C. Gniady

{"title":"IADVS: On-demand performance for interactive applications","authors":"Mingsong Bi, Igor Crk, C. Gniady","doi":"10.1109/HPCA.2010.5416649","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416649","url":null,"abstract":"Increasingly power-hungry processors have reinforced the need for aggressive power management. Dynamic voltage scaling has become a common design consideration allowing for energy efficient CPUs by matching CPU performance with the computational demand of running processes. In this paper, we propose Interaction-Aware Dynamic Voltage Scaling (IADVS), a novel fine-grained approach to managing CPU power during interactive workloads, which account for the bulk of the processing demand on modern mobile or desktop systems. IADVS is built upon a transparent, fine-grained interaction capture system. Able to track CPU usage for each user interface event, the proposed system sets the CPU performance level to the one that best matches the predicted CPU demand. Compared to the state-of-the-art approach of user-interaction-based CPU energy management, we show that IADVS improves prediction accuracy by 37%, reduces processing delays by 17%, and reduces energy consumed of the CPU by as much as 4%. The proposed design is evaluated with both a detailed trace-based simulation as well as implementation on a real system, verifying the simulation findings.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121572476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all 统一指令/翻译/数据(UNITD)一致性:一个协议来统治它们

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416643

Bogdan F. Romanescu, A. Lebeck, Daniel J. Sorin, Anne Bracy

引用次数: 58

COMIC++: A software SVM system for heterogeneous multicore accelerator clusters 面向异构多核加速器集群的软件支持向量机系统

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416633

Jaejin Lee, Jun Lee, Sangmin Seo, Jungwon Kim, Seungkyun Kim, Zehra Sura

{"title":"COMIC++: A software SVM system for heterogeneous multicore accelerator clusters","authors":"Jaejin Lee, Jun Lee, Sangmin Seo, Jungwon Kim, Seungkyun Kim, Zehra Sura","doi":"10.1109/HPCA.2010.5416633","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416633","url":null,"abstract":"In this paper, we propose a software shared virtual memory (SVM) system for heterogeneous multicore accelerator clusters with explicitly managed memory hierarchies. The target cluster consists of a single manager node and many compute nodes. The manager node contains a generalpurpose processor and larger main memory, and each compute node contains a heterogeneous multicore processor and smaller main memory. These nodes are connected with an interconnection network, such as Gigabit Ethernet. The heterogeneous multicore processor in each compute node consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The GPE runs an OS and the multiple APEs are dedicated to compute-intensive workloads. The GPE is typically backed by a deep on-chip cache hierarchy and hardware cache coherence. On the other hand, the APEs have small explicitly-addressed local memory instead of caches. This APE local memory is not coherent with the main memory. Different main and local memory units in the accelerator cluster can be viewed as an explicitly managed memory hierarchy: global memory, node local memory, and APE local memory. Since coherence protocols of previous software SVM proposals cannot effectively handle such a memory hierarchy, we propose a new coherence and consistency protocol, called hierarchical centralized release consistency (HCRC). Our software SVM system is built on top of HCRC and software-managed caches. We evaluate the effectiveness and analyze the performance of our software SVM system on a 32-node heterogeneous multicore cluster (a total of 192 APEs).","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115339477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores 引出:在可配置的多核中为单线程性能组合低开销的频率增强技术

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI: 10.1109/HPCA.2010.5416656

Brian Greskamp, Ulya R. Karpuzcu, J. Torrellas

{"title":"LeadOut: Composing low-overhead frequency-enhancing techniques for single-thread performance in configurable multicores","authors":"Brian Greskamp, Ulya R. Karpuzcu, J. Torrellas","doi":"10.1109/HPCA.2010.5416656","DOIUrl":"https://doi.org/10.1109/HPCA.2010.5416656","url":null,"abstract":"Despite the ubiquity of multicores, it is as important as ever to deliver high single-thread performance. An appealing way to accomplish this is by shutting down the idle cores in the chip and running the busy, performance-critical core(s) at higher-than-nominal frequencies. To enable such frequencies, two low-overhead approaches either boost voltage beyond nominal values, or pair cores in leader-checker configurations and let them run beyond safe frequency margins. We observe that, in a large multicore with varying numbers of busy cores, individual application of either of these two techniques is suboptimal. Each alone is often unable to bring the multicore all the way to its power or temperature envelopes due to limitations in supply voltage or error rate. Moreover, we show that the two techniques are complementary, and can be synergistically combined to unlock much higher levels of single-thread performance. Finally, we demonstrate a dynamic controller that optimizes the two techniques. Our data shows that, given a 16-core multi-core where half of the cores are already busy, an additional, performance-critical thread now attains 34% higher performance than before, while consuming 220% more power.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122010082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3