10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

筛选
英文 中文
Perceptron-Based Branch Confidence Estimation 基于感知器的分支置信度估计
Haitham Akkary, Srikanth T. Srinivasan, Rajendar Koltur, Yogesh Patil, Wael Refaai
{"title":"Perceptron-Based Branch Confidence Estimation","authors":"Haitham Akkary, Srikanth T. Srinivasan, Rajendar Koltur, Yogesh Patil, Wael Refaai","doi":"10.1109/HPCA.2004.10002","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10002","url":null,"abstract":"Pipeline gating has been proposed for reducing wasted speculative execution due to branch mispredictions. As processors become deeper or wider, pipeline gating becomes more important because the amount of wasted speculative execution increases. The quality of pipeline gating relies heavily on the branch confidence estimator used. Not much work has been done on branch confidence estimators since the initial work [6]. We show the accuracy and coverage characteristics of the initial proposals do not sufficiently reduce mis-speculative execution on future deep pipeline processors. In this paper, we present a new, perceptron-based, branch confidence estimator, which is twice as accurate as the current best-known method and achieves reasonable mispredicted branch coverage. Further, the output of our predictor is multi-valued, which enables us to classify branches further as \"strongly low confident\" and \"weakly low confident\". We reverse the predictions of \"strongly low confident\" branches and apply pipeline gating to the \"weakly low confident\" branches. This combination of pipeline gating and branch reversal provides a spectrum of interesting design options ranging from significantly reducing total execution for only a small performance loss, to lower but still significant reductions in total execution, without any performance loss.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121616332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Organizing the last line of defense before hitting the memory wall for CMPs 在击中cmp的记忆墙之前组织最后一道防线
Chun Liu, A. Sivasubramaniam, M. Kandemir
{"title":"Organizing the last line of defense before hitting the memory wall for CMPs","authors":"Chun Liu, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/HPCA.2004.10017","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10017","url":null,"abstract":"The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 183
Improving Disk Throughput in Data-Intensive Servers 提高数据密集型服务器的磁盘吞吐量
E. V. Carrera, R. Bianchini
{"title":"Improving Disk Throughput in Data-Intensive Servers","authors":"E. V. Carrera, R. Bianchini","doi":"10.1109/HPCA.2004.10023","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10023","url":null,"abstract":"Low disk throughput is one of the main impediments to improving the performance of data-intensive servers. In this paper, we propose two management techniques for the disk controller cache that can significantly increase disk throughput. The first technique, called File-Oriented Read-ahead (FOR), adjusts the number of read-ahead blocks brought into the disk controller cache according to file system information. The second technique, called Host-guided Device Caching (HDC), gives the host control over part of the disk controller cache. As an example use of this mechanism, we keep the blocks that cause the most misses in the host buffer cache permanently cached in the disk controller. Our detailed simulations of real server workloads show that FOR and HDC can increase disk throughput by up to 34% and 24%, respectively, in comparison to conventional disk controller cache management techniques. When combined, the techniques can increase throughput by up to 47%.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A low-complexity, high-performance fetch unit for simultaneous multithreading processors 一种用于并发多线程处理器的低复杂度、高性能读取单元
Ayose Falcón, Alex Ramírez, M. Valero
{"title":"A low-complexity, high-performance fetch unit for simultaneous multithreading processors","authors":"Ayose Falcón, Alex Ramírez, M. Valero","doi":"10.1109/HPCA.2004.10003","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10003","url":null,"abstract":"Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129664719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Hardware Support for Prescient Instruction Prefetch 预知指令预取的硬件支持
Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen
{"title":"Hardware Support for Prescient Instruction Prefetch","authors":"Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen","doi":"10.1109/HPCA.2004.10028","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10028","url":null,"abstract":"This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115195813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management 通过功率感知缓存管理降低磁盘存储能耗
Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao
{"title":"Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management","authors":"Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao","doi":"10.1109/HPCA.2004.10022","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10022","url":null,"abstract":"Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 242
Data Cache Prefetching Using a Global History Buffer 使用全局历史缓冲区的数据缓存预取
Kyle J. Nesbit, James E. Smith
{"title":"Data Cache Prefetching Using a Global History Buffer","authors":"Kyle J. Nesbit, James E. Smith","doi":"10.1109/MM.2005.6","DOIUrl":"https://doi.org/10.1109/MM.2005.6","url":null,"abstract":"A new structure for implementing data cache prefetching is proposed and analyzed via simulation. The structure is based on a Global History Buffer that holds the most recent miss addresses in FIFO order. Linked lists within this global history buffer connect addresses that have some common property, e.g. they were all generated by the same load instruction. The Global History Buffer can be used for implementing a number of previously proposed prefetch methods, as well as new ones. Prefetching with the Global History Buffer has two significant advantages over conventional table prefetching methods. First, the use of a FIFO history buffer can improve the accuracy of correlation prefetching by eliminating stale data from the table. Second, the Global History Buffer contains a more complete (and intact) picture of cache miss history, creating opportunities to design more effective prefetching methods. Global History Buffer prefetching can increase correlation prefetching performance by 20% and cut its memory traffic by 90%. Furthermore, the Global History Buffer can make correlations within a load’s address stream, which can increase stride prefetching performance by 6%. Collectively, the Global History Buffer prefetching methods perform as well or better than the conventional prefetching methods studied on 14 of 15 benchmarks.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122993672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 375
Reducing branch misprediction penalty via selective branch recovery 通过选择性分支恢复减少分支错误预测惩罚
A. Gandhi, Haitham Akkary, Srikanth T. Srinivasan
{"title":"Reducing branch misprediction penalty via selective branch recovery","authors":"A. Gandhi, Haitham Akkary, Srikanth T. Srinivasan","doi":"10.1109/HPCA.2004.10004","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10004","url":null,"abstract":"Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Exploiting prediction to reduce power on buses 利用预测来减少公共汽车的电力
V. Wen, M. Whitney, Yatish Patel, J. Kubiatowicz
{"title":"Exploiting prediction to reduce power on buses","authors":"V. Wen, M. Whitney, Yatish Patel, J. Kubiatowicz","doi":"10.1109/HPCA.2004.10025","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10025","url":null,"abstract":"We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 /spl mu/m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5 mm at 0.13 /spl mu/ and project a break-even point of 2.7 mm for a larger design at 0.07 /spl mu/.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Out-of-order commit processors 乱序提交处理器
A. Cristal, Daniel Ortega, J. Llosa, M. Valero
{"title":"Out-of-order commit processors","authors":"A. Cristal, Daniel Ortega, J. Llosa, M. Valero","doi":"10.1109/HPCA.2004.10008","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10008","url":null,"abstract":"Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124654753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 148
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信