Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture最新文献

筛选
英文 中文
Register renaming and scheduling for dynamic execution of predicated code 动态执行谓词代码的注册重命名和调度
P. Wang, Hong Wang, R. Kling, K. Ramakrishnan, John Paul Shen
{"title":"Register renaming and scheduling for dynamic execution of predicated code","authors":"P. Wang, Hong Wang, R. Kling, K. Ramakrishnan, John Paul Shen","doi":"10.1109/HPCA.2001.903248","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903248","url":null,"abstract":"To achieve higher processor performance requires greater synergy between advanced hardware features and innovative compiler techniques. Recent advancement in compilation techniques for predicated execution has provided significant opportunity in exploiting instruction level parallelism. However, little research has been done on how to efficiently execute predicated code in a dynamic microarchitecture. In this paper, we evaluate hardware optimizations for executing predicated code on a dynamically scheduled microarchitecture. We provide two novel ideas to improve the efficiency of executing predicated code. On a generic Intel Itanium processor pipeline model, we demonstrate that, with some microarchitecture enhancements, a dynamic execution processor can achieve about 16% performance improvement over an equivalent static execution processor.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115582773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Data-flow prescheduling for large instruction windows in out-of-order processors 乱序处理器中大指令窗口的数据流预调度
P. Michaud, André Seznec
{"title":"Data-flow prescheduling for large instruction windows in out-of-order processors","authors":"P. Michaud, André Seznec","doi":"10.1109/HPCA.2001.903249","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903249","url":null,"abstract":"The performance of out-of-order processors increases with the instruction window size, In conventional processors, the effective instruction window cannot be larger than the issue buffer. Determining which instructions from the issue buffer can be launched to the execution units is a time-critical operation which complexity increases with the issue buffer size. We propose to relieve the issue stage by reordering instructions before they enter the issue buffer. This study introduces the general principle of data flow prescheduling. Then we describe a possible implementation. Our preliminary results show that data-flow prescheduling makes it possible to enlarge the effective instruction window while keeping the issue buffer small.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128286994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
JETTY: filtering snoops for reduced energy consumption in SMP servers JETTY:过滤窥探,减少SMP服务器的能耗
Andreas Moshovos, G. Memik, B. Falsafi, A. Choudhary
{"title":"JETTY: filtering snoops for reduced energy consumption in SMP servers","authors":"Andreas Moshovos, G. Memik, B. Falsafi, A. Choudhary","doi":"10.1109/HPCA.2001.903254","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903254","url":null,"abstract":"We propose methods for reducing the energy consumed by snoop requests in snoopy bus-based symmetric multiprocessor (SMP) systems. Observing that a large fraction of snoops do not find copies in many of the other caches, we introduce JETTY, a small, cache-like structure. A JETTY is introduced in-between the bus and the L2 backside of each processor. There it filters the vast majority of snoops that would not find a locally cached copy. Energy is reduced as accesses to the much more energy demanding L2 tag arrays are decreased. No changes in the existing coherence protocol are required and no performance loss is experienced. We evaluate our method on a 4-way SMP server using a set of shared-memory applications. We demonstrate that a very small JETTY filters 74% (average) of all snoop-induced tag accesses that would miss. This results in an average energy reduction of 29% (range: 12% to 40%) measured as a fraction of the energy required by all L2 accesses (both tag and data arrays).","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124564549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 190
Dynamic prediction of critical path instructions 关键路径指令的动态预测
Eric Tune, Dongning Liang, D. Tullsen, B. Calder
{"title":"Dynamic prediction of critical path instructions","authors":"Eric Tune, Dongning Liang, D. Tullsen, B. Calder","doi":"10.1109/HPCA.2001.903262","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903262","url":null,"abstract":"Modern processors come close to executing as fast as role dependences allow. The particular dependences that constrain execution speed constitute the critical path of execution. To optimize the performance of the processor we either have to reduce the critical path or execute it more efficiently. In both cases, it can be done more effectively if we know the actual instructions that constitute that path. This paper describes critical path prediction for dynamically identifying instructions likely to be on the critical path, allowing various processor optimizations to take advantage of this information. We show several possible critical path prediction techniques and apply critical path prediction to value prediction and clustered architecture scheduling. We show that critical path prediction has the potential to increase the effectiveness of these hardware optimizations by as much as 70%, without adding greatly to their cost.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123506224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 147
A delay model and speculative architecture for pipelined routers 管道路由器的延迟模型和推测结构
L. Peh, W. Dally
{"title":"A delay model and speculative architecture for pipelined routers","authors":"L. Peh, W. Dally","doi":"10.1109/HPCA.2001.903268","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903268","url":null,"abstract":"This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed the delay of the flow control credit path, and the sharing of crossbar ports across virtual channels. Motivated by this model, we introduce a microarchitecture for a speculative virtual-channel router that significantly reduces its router latency to that of a brown hole router. Simulations using our pipelined model give results that differ considerably from the commonly assumed 'unit-latency' model which is unreasonably optimistic. Using realistic pipeline models, we compare wormhole and virtual-channel flow control. Our results show that a speculative virtual-channel router has the same per-hop router latency as a wormhole router while improving throughput by up to 40%.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116953633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 575
Differential FCM: increasing value prediction accuracy by improving table usage efficiency 差分FCM:通过提高表的使用效率来提高值预测的准确性
B. Goeman, H. Vandierendonck, K. D. Bosschere
{"title":"Differential FCM: increasing value prediction accuracy by improving table usage efficiency","authors":"B. Goeman, H. Vandierendonck, K. D. Bosschere","doi":"10.1109/HPCA.2001.903264","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903264","url":null,"abstract":"Value prediction is a relatively new technique to increase the Instruction Level Parallelism (ILP) in future microprocessors. An important problem when designing a value predictor is efficiency, an accurate predictor requires huge prediction tables. This is especially the case for the finite context method (FCM) predictor the most accurate one. In this paper, we show that the prediction accuracy of the FCM can be greatly improved by making the FCM predict studies instead of values. This new predictor is called the differential finite context method (DFCM) predictor. The DFCM predictor outperforms a similar FCM predictor by as much as 33%, depending on the prediction table size. If we take the additional storage into account, the difference is still 15% for realistic predictor sizes. We use several metrics to show that the key to this success is reduced aliasing in the level-2 table. We also show that the DFCM is superior to hybrid predictors based on FCM and stride predictors, since its prediction accuracy is higher than that of a hybrid one using a perfect meta-predictor.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123421775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 142
Stack value file: custom microarchitecture for the stack 堆栈值文件:堆栈的自定义微架构
H. Lee, M. Smelyanskiy, C. Newburn, G. Tyson
{"title":"Stack value file: custom microarchitecture for the stack","authors":"H. Lee, M. Smelyanskiy, C. Newburn, G. Tyson","doi":"10.1109/HPCA.2001.903247","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903247","url":null,"abstract":"As processor performance increases, there is a corresponding increase in the demands on the memory system, including caches. Research papers have proposed partitioning the cache into instruction/data, temporal/non-temporal, and/or stack/non-stack regions. Each of these designs can improve performance by constructing two separate structures which can be probed in parallel while reducing contention. In this paper, we propose a new memory organization that partitions data references into stack and nonstack regions. Non-stack references are routed to a conventional cache. Stack references, on the other hand, are shown to have several characteristics that can be leveraged to improve performance using a less conventional storage organization. This paper enumerates those characteristics and proposes a new microarchitectural feature, the stack value file (SVF), which exploits them to improve instruction-level parallelism, reduce stack access latencies, reduce demand on the first-level cache, and reduce data bus traffic. Our results show that the SVF can improve execution performance by 29 to 65% while reducing overhead traffic for the stack region by many orders of magnitude over cache structures of the same size.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125902210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Quantifying the impact of architectural scaling on communication 量化架构扩展对通信的影响
Taliver Heath, Samian Kaur, R. Martin, Thu D. Nguyen
{"title":"Quantifying the impact of architectural scaling on communication","authors":"Taliver Heath, Samian Kaur, R. Martin, Thu D. Nguyen","doi":"10.1109/HPCA.2001.903269","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903269","url":null,"abstract":"This work quantifies how persistent increases in processor speed compared to I/O speed reduce the performance gap between specialized, high performance messaging layers and general purpose protocols such as TCP/IP and UDP/IP. The comparison is important because specialized layers sacrifice considerable system connectivity and robustness to obtain increased performance. We first quantify the scaling effects on small messages by measuring the LogP performance of two Active Message II layers, one running over a specialized VIA layer and the other over stock UDP as we scale the CPU and I/O components. We then predict future LogP performance by mapping the LogP model's network parameters, particularly overhead into architectural components. Our projections show that the performance benefit afforded by specialized messaging for small messages will erode to a factor of 2 in the next 5 years. Our models further show that the performance differential between the two approaches will continue to erode without a radical restructuring of the I/O system. For long messages, we quantify the variable per-page instruction budget that a zero-copy messaging approach has for page table manipulations if it is to outperform a single-copy approach. Finally we conclude with an examination of future I/O advances that would result in substantial improvements to messaging performance.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129077352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Call graph prefetching for database applications 为数据库应用程序调用图预取
M. Annavaram, J. Patel, E. Davidson
{"title":"Call graph prefetching for database applications","authors":"M. Annavaram, J. Patel, E. Davidson","doi":"10.1109/HPCA.2001.903270","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903270","url":null,"abstract":"With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. Previous work has shown that database applications have large instruction and data footprints and hence do not use processor caches effectively. In this paper we propose Call Graph Prefetching (CGP), a hardware technique that analyzes the call graph of a database system and prefetches instructions from the function that is deemed likely to be called next. CGP capitalizes on the highly predictable function call sequences that are typical of database systems. We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on CPU-2000 benchmarks. For most CPU-2000 applications the number of l-cache misses were very few even without any prefetching, obviating the need for CGP. Our database experiments show that CGP reduces the I-cache misses by 83% and can improve the performance of a database system by 30% over a baseline system that uses the OM tool to layout the code so as to improve I-cache performance. CGP also achieved 7% higher performance than OM with next-N-line prefetching on database applications.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114193909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
DLP+TLP processors for the next generation of media workloads 下一代媒体工作负载的DLP+TLP处理器
J. Corbal, R. Espasa, M. Valero
{"title":"DLP+TLP processors for the next generation of media workloads","authors":"J. Corbal, R. Espasa, M. Valero","doi":"10.1109/HPCA.2001.903265","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903265","url":null,"abstract":"Future media workloads will require about two levels of magnitude the performance achieved by current general purpose processors. High uni-threaded performance will be needed to accomplish real-time constraints together with huge computational throughput, as next generation of media workloads will be eminently multithreaded (MPEG-4/MPEG-7). In order to fulfil the challenge of providing both good uni-threaded performance and throughput, we propose to join the simultaneous multithreading execution paradigm (SMT) together with the ability to execute media-oriented streaming /spl mu/-SIMD instructions. This paper evaluates the performance of two different aggressive SMT processors: one with conventional /spl mu/-SIMD extensions (such as MMX) and one with longer streaming vector /spl mu/-SIMD extensions. We will show that future media workloads are, in fact, dominated by the scalar performance. The combination of SMT plus streaming vector /spl mu/-SIMD helps alleviate the performance bottleneck of the integer unit. SMT allows \"hiding\" vector execution underneath integer execution by overlapping the two types of computation, while the streaming vector /spl mu/-SIMD reduces the pressure on issue width and fetch bandwidth, and provides a powerful mechanism to tolerate latency that allows to implement smart decoupled cache hierarchies.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131580268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信