Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture最新文献

Register renaming and scheduling for dynamic execution of predicated code 动态执行谓词代码的注册重命名和调度

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903248

P. Wang, Hong Wang, R. Kling, K. Ramakrishnan, John Paul Shen

引用次数: 45

Data-flow prescheduling for large instruction windows in out-of-order processors 乱序处理器中大指令窗口的数据流预调度

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903249

P. Michaud, André Seznec

引用次数: 115

JETTY: filtering snoops for reduced energy consumption in SMP servers JETTY:过滤窥探，减少SMP服务器的能耗

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903254

Andreas Moshovos, G. Memik, B. Falsafi, A. Choudhary

引用次数: 190

Dynamic prediction of critical path instructions 关键路径指令的动态预测

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903262

Eric Tune, Dongning Liang, D. Tullsen, B. Calder

引用次数: 147

A delay model and speculative architecture for pipelined routers 管道路由器的延迟模型和推测结构

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903268

L. Peh, W. Dally

引用次数: 575

Differential FCM: increasing value prediction accuracy by improving table usage efficiency 差分FCM:通过提高表的使用效率来提高值预测的准确性

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903264

B. Goeman, H. Vandierendonck, K. D. Bosschere

{"title":"Differential FCM: increasing value prediction accuracy by improving table usage efficiency","authors":"B. Goeman, H. Vandierendonck, K. D. Bosschere","doi":"10.1109/HPCA.2001.903264","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903264","url":null,"abstract":"Value prediction is a relatively new technique to increase the Instruction Level Parallelism (ILP) in future microprocessors. An important problem when designing a value predictor is efficiency, an accurate predictor requires huge prediction tables. This is especially the case for the finite context method (FCM) predictor the most accurate one. In this paper, we show that the prediction accuracy of the FCM can be greatly improved by making the FCM predict studies instead of values. This new predictor is called the differential finite context method (DFCM) predictor. The DFCM predictor outperforms a similar FCM predictor by as much as 33%, depending on the prediction table size. If we take the additional storage into account, the difference is still 15% for realistic predictor sizes. We use several metrics to show that the key to this success is reduced aliasing in the level-2 table. We also show that the DFCM is superior to hybrid predictors based on FCM and stride predictors, since its prediction accuracy is higher than that of a hybrid one using a perfect meta-predictor.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123421775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 142

Stack value file: custom microarchitecture for the stack 堆栈值文件:堆栈的自定义微架构

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903247

H. Lee, M. Smelyanskiy, C. Newburn, G. Tyson

{"title":"Stack value file: custom microarchitecture for the stack","authors":"H. Lee, M. Smelyanskiy, C. Newburn, G. Tyson","doi":"10.1109/HPCA.2001.903247","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903247","url":null,"abstract":"As processor performance increases, there is a corresponding increase in the demands on the memory system, including caches. Research papers have proposed partitioning the cache into instruction/data, temporal/non-temporal, and/or stack/non-stack regions. Each of these designs can improve performance by constructing two separate structures which can be probed in parallel while reducing contention. In this paper, we propose a new memory organization that partitions data references into stack and nonstack regions. Non-stack references are routed to a conventional cache. Stack references, on the other hand, are shown to have several characteristics that can be leveraged to improve performance using a less conventional storage organization. This paper enumerates those characteristics and proposes a new microarchitectural feature, the stack value file (SVF), which exploits them to improve instruction-level parallelism, reduce stack access latencies, reduce demand on the first-level cache, and reduce data bus traffic. Our results show that the SVF can improve execution performance by 29 to 65% while reducing overhead traffic for the stack region by many orders of magnitude over cache structures of the same size.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125902210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Quantifying the impact of architectural scaling on communication 量化架构扩展对通信的影响

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903269

Taliver Heath, Samian Kaur, R. Martin, Thu D. Nguyen

{"title":"Quantifying the impact of architectural scaling on communication","authors":"Taliver Heath, Samian Kaur, R. Martin, Thu D. Nguyen","doi":"10.1109/HPCA.2001.903269","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903269","url":null,"abstract":"This work quantifies how persistent increases in processor speed compared to I/O speed reduce the performance gap between specialized, high performance messaging layers and general purpose protocols such as TCP/IP and UDP/IP. The comparison is important because specialized layers sacrifice considerable system connectivity and robustness to obtain increased performance. We first quantify the scaling effects on small messages by measuring the LogP performance of two Active Message II layers, one running over a specialized VIA layer and the other over stock UDP as we scale the CPU and I/O components. We then predict future LogP performance by mapping the LogP model's network parameters, particularly overhead into architectural components. Our projections show that the performance benefit afforded by specialized messaging for small messages will erode to a factor of 2 in the next 5 years. Our models further show that the performance differential between the two approaches will continue to erode without a radical restructuring of the I/O system. For long messages, we quantify the variable per-page instruction budget that a zero-copy messaging approach has for page table manipulations if it is to outperform a single-copy approach. Finally we conclude with an examination of future I/O advances that would result in substantial improvements to messaging performance.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129077352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Call graph prefetching for database applications 为数据库应用程序调用图预取

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903270

M. Annavaram, J. Patel, E. Davidson

{"title":"Call graph prefetching for database applications","authors":"M. Annavaram, J. Patel, E. Davidson","doi":"10.1109/HPCA.2001.903270","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903270","url":null,"abstract":"With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. Previous work has shown that database applications have large instruction and data footprints and hence do not use processor caches effectively. In this paper we propose Call Graph Prefetching (CGP), a hardware technique that analyzes the call graph of a database system and prefetches instructions from the function that is deemed likely to be called next. CGP capitalizes on the highly predictable function call sequences that are typical of database systems. We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on CPU-2000 benchmarks. For most CPU-2000 applications the number of l-cache misses were very few even without any prefetching, obviating the need for CGP. Our database experiments show that CGP reduces the I-cache misses by 83% and can improve the performance of a database system by 30% over a baseline system that uses the OM tool to layout the code so as to improve I-cache performance. CGP also achieved 7% higher performance than OM with next-N-line prefetching on database applications.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114193909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

DLP+TLP processors for the next generation of media workloads 下一代媒体工作负载的DLP+TLP处理器

Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture Pub Date : 2001-01-20 DOI: 10.1109/HPCA.2001.903265

J. Corbal, R. Espasa, M. Valero

{"title":"DLP+TLP processors for the next generation of media workloads","authors":"J. Corbal, R. Espasa, M. Valero","doi":"10.1109/HPCA.2001.903265","DOIUrl":"https://doi.org/10.1109/HPCA.2001.903265","url":null,"abstract":"Future media workloads will require about two levels of magnitude the performance achieved by current general purpose processors. High uni-threaded performance will be needed to accomplish real-time constraints together with huge computational throughput, as next generation of media workloads will be eminently multithreaded (MPEG-4/MPEG-7). In order to fulfil the challenge of providing both good uni-threaded performance and throughput, we propose to join the simultaneous multithreading execution paradigm (SMT) together with the ability to execute media-oriented streaming /spl mu/-SIMD instructions. This paper evaluates the performance of two different aggressive SMT processors: one with conventional /spl mu/-SIMD extensions (such as MMX) and one with longer streaming vector /spl mu/-SIMD extensions. We will show that future media workloads are, in fact, dominated by the scalar performance. The combination of SMT plus streaming vector /spl mu/-SIMD helps alleviate the performance bottleneck of the integer unit. SMT allows \"hiding\" vector execution underneath integer execution by overlapping the two types of computation, while the streaming vector /spl mu/-SIMD reduces the pressure on issue width and fetch bandwidth, and provides a powerful mechanism to tolerate latency that allows to implement smart decoupled cache hierarchies.","PeriodicalId":336788,"journal":{"name":"Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131580268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13