Proceedings Eighth International Symposium on High Performance Computer Architecture最新文献

筛选
英文 中文
Power issues related to branch prediction 与分支预测相关的电源问题
Dharmesh Parikh, K. Skadron, Yan Zhang, M. Barcella, M. Stan
{"title":"Power issues related to branch prediction","authors":"Dharmesh Parikh, K. Skadron, Yan Zhang, M. Barcella, M. Stan","doi":"10.1109/HPCA.2002.995713","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995713","url":null,"abstract":"This paper explores the role of branch predictor organization in power/energy/performance tradeoffs for processor design. We find that as a general rule, to reduce overall energy consumption in the processor it is worthwhile to spend more power in the branch predictor if this results in more accurate predictions that improve running time. Two techniques, however, provide substantial reductions in power dissipation without harming accuracy. Banking reduces the portion of the branch predictor that is active at any one time. And a new on-chip structure, the prediction probe detector (PPD), can use pre-decode bits to entirely eliminate unnecessary predictor and branch target buffer (BTB) accesses. Despite the extra power that must be spent accessing the PPD, it reduces local predictor power and energy dissipation by about 45% and overall processor power and energy dissipation by 5-6%.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115440978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
The minimax cache: an energy-efficient framework for media processors 极大极小缓存:媒体处理器的高能效框架
O. Unsal, I. Koren, C. M. Krishna, C. A. Moritz
{"title":"The minimax cache: an energy-efficient framework for media processors","authors":"O. Unsal, I. Koren, C. M. Krishna, C. A. Moritz","doi":"10.1109/HPCA.2002.995704","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995704","url":null,"abstract":"This work is based on our philosophy of providing interlayer system-level power awareness in computing systems. Here, we couple this approach with our vision of multi-partitioned memory systems, where memory accesses are separated based on their static predictability and memory footprint and managed with various compiler controlled techniques. We show that media applications are mapped more efficiently when scalar memory accesses are redirected to a mini-cache. Our results indicate that a partitioned 8K cache with the scalars being mapped to a 512 byte mini-cache can be more efficient than a 16K monolithic cache from both performance and energy point of view for most applications. In extensive experiments, we report 30% to 60% energy-delay product savings over a range of system configurations and different cache sizes.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121878688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management 精确和局部动态热管理的控制理论技术和热rc建模
K. Skadron, T. Abdelzaher, M. Stan
{"title":"Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management","authors":"K. Skadron, T. Abdelzaher, M. Stan","doi":"10.1109/HPCA.2002.995695","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995695","url":null,"abstract":"This paper proposes the use of formal feedback control theory as a way to implement adaptive techniques in the processor architecture. Dynamic thermal management (DTM) is used as a test vehicle, and variations of a PID controller (Proportional-Integral-Differential) are developed and tested for adaptive control of fetch \"toggling.\" To accurately test the DTM mechanism being proposed, this paper also develops a thermal model based on lumped thermal resistances and thermal capacitances. This model is computationally efficient and tracks temperature at the granularity of individual functional blocks within the processor. Because localized heating occurs much faster than chip-wide heating, some parts of the processor are more likely, to be \"hot spots\" than others. Experiments using Wattch and the SPEC2000 benchmarks show that the thermal trigger threshold can be set within 0.2/spl deg/ of the maximum temperature and yet never enter thermal emergency. This cuts the performance loss of DTM by 65% compared to the previously described fetch toggling technique that uses a response of fixed magnitude.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133701982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 426
Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling 采用动态电压和频率缩放的多时钟域节能处理器设计
Greg Semeraro, G. Magklis, R. Balasubramonian, D. Albonesi, S. Dwarkadas, M. Scott
{"title":"Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling","authors":"Greg Semeraro, G. Magklis, R. Balasubramonian, D. Albonesi, S. Dwarkadas, M. Scott","doi":"10.1109/HPCA.2002.995696","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995696","url":null,"abstract":"As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a multiple clock domain (MCD) processor, in which the chip is divided into several clock domains, within which independent voltage and frequency scaling can be performed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end , integer units, floating point units, and load-store units. We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use off-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132159561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 401
Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay 利用可调整大小缓存设计的选择来优化深亚微米处理器的能量延迟
Se-Hyun Yang, Michael D. Powell, B. Falsafi, T. N. Vijaykumar
{"title":"Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay","authors":"Se-Hyun Yang, Michael D. Powell, B. Falsafi, T. N. Vijaykumar","doi":"10.1109/HPCA.2002.995706","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995706","url":null,"abstract":"Cache memories account for a significant fraction of a chip's overall energy dissipation. Recent research advocates using \"resizable\" caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache's unused sections with minimal impact on performance. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache's set-associativity, while the other, referred to as selective-sets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application's execution, while the other allows for dynamic resizing both within and across applications. In this paper, we compare and contrast, for the first time, the proposed design choices for resizable caches, and evaluate the effectiveness of cache resizings in reducing the overall energy-delay in deep-submicron processors. In addition, we propose a hybrid selective-sets-and-ways cache organization that always offers equal or better resizing granularity than both of previously proposed organizations. We also investigate the energy savings from resizing d-cache and i-cache together to characterize the interaction between d-cache and i-cache resizings.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117134450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 149
Improving value communication for thread-level speculation 改进线程级推测的值通信
J. Steffan, Christopher B. Colohan, Antonia Zhai, T. Mowry
{"title":"Improving value communication for thread-level speculation","authors":"J. Steffan, Christopher B. Colohan, Antonia Zhai, T. Mowry","doi":"10.1109/HPCA.2002.995699","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995699","url":null,"abstract":"Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this paper, we show that the key to good performance ties in the three different ways to communicate a value between speculative threads: speculation, synchronization and prediction. The difficult part is deciding how and when to apply each method. This paper shows how we can apply value prediction, dynamic synchronization and hardware instruction prioritization to improve value communication and hence performance in several SPECint benchmarks that have been automatically transformed by our compiler to exploit TLS. We find that value prediction can be effective when properly throttled to avoid the high costs of mis-prediction, while most of the gains of value prediction can be more easily achieved by exploiting silent stores. We also show that dynamic synchronization is quite effective for most benchmarks, while hardware instruction prioritization is not. Overall, we find that these techniques have great potential for improving the performance of TLS.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123558493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 121
User-level communication in cluster-based servers 基于集群的服务器中的用户级通信
E. V. Carrera, S. Rao, L. Iftode, R. Bianchini
{"title":"User-level communication in cluster-based servers","authors":"E. V. Carrera, S. Rao, L. Iftode, R. Bianchini","doi":"10.1109/HPCA.2002.995717","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995717","url":null,"abstract":"Clusters of commodity computers are currently being used to provide the scalability required by several popular Internet services. In this paper we evaluate an efficient cluster-based WWW server, as a function of the characteristics of the intra-cluster communication architecture. More specifically, we evaluate the impact of processor overhead, network bandwidth, remote memory writes, and zero-copy data transfers on the performance of our server. Our experimental results with an 8-node cluster and four real WWW traces show that network bandwidth affects the performance of our server by only 6%. In contrast, user-level communication can improve performance by as much as 29%. Low processor overhead, remote memory writes, and zero-copy all make small contributions towards this overall gain. To be able to extrapolate from our experimental results, we use an analytical model to assess the performance of our server under different workload characteristics, different numbers of cluster nodes, and higher performance systems. Our modeling results show that higher gains (of up to 55%) can be accrued for workloads with large working sets and next-generation servers running on large clusters.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121471346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
The FAB predictor: using Fourier analysis to predict the outcome of conditional branches FAB预测器:使用傅里叶分析来预测条件分支的结果
Martin Kämpe, P. Stenström, M. Dubois
{"title":"The FAB predictor: using Fourier analysis to predict the outcome of conditional branches","authors":"Martin Kämpe, P. Stenström, M. Dubois","doi":"10.1109/HPCA.2002.995712","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995712","url":null,"abstract":"This paper proposes to transform the branch outcome history from the time domain to the frequency domain. With our proposed Fourier Analysis Branch (FAB) predictor, we can represent long periodic branch history patterns - as long as 2/sup 13/ bits - with a realistic number of bits (52 bits). We evaluate the potential gains of the FAB predictor by considering a hybrid branch predictor in which each branch is predicted using a static scheme, the 2-bit dynamic scheme, the PAp and GAp schemes, and our FAB predictor. By including our FAB predictor in the hybrid predictor, it is possible to cut the misprediction rate of integer applications in the SPEC95 suite by between 5 and 50% with an average of 20%. Besides evaluating its performance, this paper shows some key properties of our FAB predictor and presents some possible implementation approaches.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130374416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A new memory monitoring scheme for memory-aware scheduling and partitioning 一种新的内存监控方案,用于内存感知调度和分区
G. Suh, S. Devadas, L. Rudolph
{"title":"A new memory monitoring scheme for memory-aware scheduling and partitioning","authors":"G. Suh, S. Devadas, L. Rudolph","doi":"10.1109/HPCA.2002.995703","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995703","url":null,"abstract":"We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to improve scheduling and partitioning schemes.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127872144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 324
Microarchitectural simulation and control of di/dt-induced power supply voltage variation di/dt感应电源电压变化的微结构仿真与控制
Edward T. Grochowski, D. Ayers, V. Tiwari
{"title":"Microarchitectural simulation and control of di/dt-induced power supply voltage variation","authors":"Edward T. Grochowski, D. Ayers, V. Tiwari","doi":"10.1109/HPCA.2002.995694","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995694","url":null,"abstract":"As the power consumption of modern high-performance microprocessors increases beyond 100 W, power becomes an increasingly important design consideration. This paper presents a novel technique to simulate power supply voltage variation as a result of varying activity levels within the microprocessor when executing typical software. The voltage simulation capability may be added to existing microarchitecture simulators that determine the activities of each functional block on a clock-by-clock basis. We then discuss how the same technique can be implemented in logic on the microprocessor die to enable real-time computation of current consumption and power supply voltage. When used in a feedback loop, this logic makes it possible to control the microprocessor's activities to reduce demands on the power delivery system. With on-die voltage computation and di/dt control, we show that a significant reduction in power supply voltage variation may be achieved with little performance loss or average power increase.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128422850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信