Proceedings Eighth International Symposium on High Performance Computer Architecture最新文献_第2页

Tuning garbage collection in an embedded Java environment 在嵌入式Java环境中调优垃圾收集

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995701

Guangyu Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, M. Wolczko

引用次数: 54

Modeling value speculation 建模价值推测

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995711

Yiannakis Sazeides

{"title":"Modeling value speculation","authors":"Yiannakis Sazeides","doi":"10.1109/HPCA.2002.995711","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995711","url":null,"abstract":"Several studies of speculative execution based on values have reported promising performance potential. However, virtually all microarchitectures in these studies were described in an ambiguous manner, mainly due to the lack of formalization that defines the effects of value-speculation on a microarchitecture. In particular, the manifestations of value-speculation on the latency of microarchitectural operations, such as releasing resources and reissuing, was at best partially addressed. This may be problematic since results obtained in these studies can be difficult to reproduce and/or appreciate their contribution. This paper introduces a model for a methodical description of dynamically-scheduled microarchitectures that use value-speculation. The model isolates the parts of a microarchitecture that may be influenced by value-speculation in terms of various variables and latency events. This provides systematic means for describing, evaluating and comparing the,performance of value-speculative microarchitectures. The model parameters are integrated in a simulator to investigate the performance of several value-speculation related events. Among other, the results show value-speculation performance to have non-uniform sensitivity to changes in the latency of these events. For example, fast verification latency is found to be essential, but when mis-speculation is infrequent slow invalidation may be acceptable.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116188026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Reverse Tracer: a software tool for generating realistic performance test programs 反向跟踪器:一个软件工具，用于生成真实的性能测试程序

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995700

M. Sakamoto, Larry Brisson, A. Katsuno, Aiichiro Inoue, Yasunori Kimura

{"title":"Reverse Tracer: a software tool for generating realistic performance test programs","authors":"M. Sakamoto, Larry Brisson, A. Katsuno, Aiichiro Inoue, Yasunori Kimura","doi":"10.1109/HPCA.2002.995700","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995700","url":null,"abstract":"During the development of high-performance processors, software performance models are used to obtain performance estimates. These models are not cycle-accurate, so their results can have significant errors, leading to performance surprises after the hardware is built. Some performance tests can run directly on the logic simulators, to get more accurate results, but those simulators cannot run large interactive workloads with I/O and much operating system code. So the accurate performance estimates from logic simulators are only available for application code, and are not adequate for the evaluation of powerful server systems that are primarily intended to run large interactive workloads. We discuss a software tool system, the \"Reverse Tracer\", that generates executable performance tests from an instruction trace of the workload. The generated performance tests retain the essential performance characteristics of multi-user I/O-intensive workloads without doing any real I/O, so they can run in logic simulation to measure performance accurately before the hardware is built.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Using internal redundant representations and limited bypass to support pipelined adders and register files 使用内部冗余表示和有限的旁路来支持流水线加法器和寄存器文件

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995718

Mary D. Brown, Y. Patt

{"title":"Using internal redundant representations and limited bypass to support pipelined adders and register files","authors":"Mary D. Brown, Y. Patt","doi":"10.1109/HPCA.2002.995718","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995718","url":null,"abstract":"This paper evaluates the use of redundant binary and pipelined 2's complement adders in out-of-order execution cores. Redundant binary adders reduce the ADD latency to less than half that of traditional 2's complement adders, allowing higher core clock frequencies and greater execution bandwidth (in instructions per second). Pipelined 2's complement adders allow a higher clock frequency, but do not reduce the ADD latency. Machines with redundant binary adders are compared to machines with 2's complement adders and the same execution bandwidth and bypass network complexity. Results show that on the SPECint95 benchmarks, the average IPC of an 8-wide machine with 1-cycle redundant binary adders is 9% higher than a machine using 2-cycle pipelined adders. Pipelined functional units and multi-cycle register files may require multi-level bypass networks to guarantee that an instruction's result is available any cycle after it is produced. Multi-level bypass networks require large fan-in input mixes that increase cycle time. This paper shows that one level of bypass paths in a multi-level bypass network can be removed while still achieving within 3% to 1% of the IPC of a machine with a full bypass network.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116061422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Quantifying load stream behavior 量化负载流行为

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995710

S. Sair, T. Sherwood, B. Calder

{"title":"Quantifying load stream behavior","authors":"S. Sair, T. Sherwood, B. Calder","doi":"10.1109/HPCA.2002.995710","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995710","url":null,"abstract":"The increasing performance gap between processors and memory will force future architectures to devote significant resources towards removing and hiding memory latency. The two major architectural features used to address this growing gap are caches and prefetching. In this paper we perform a detailed quantification of the cache miss patterns for the Olden benchmarks, SPEC 2000 benchmarks, and a collection of pointer based applications. We classify misses into one of four categories corresponding to the type of access pattern. These are next-line, stride, same-object (additional misses that occur to a recently accessed object), or pointer-based transitions. We then propose and evaluate a hardware profiling architecture to correctly identify which type of access pattern is being seen. This access pattern identification could be used to help guide and allocate prefetching resources, and provide information to feedback-directed optimizations. A second goal of this paper is to identify a suite of challenging pointer-based benchmarks that can be used to focus the development of new software and hardware prefetching algorithms, and identify the challenges in performing prefetching for these applications using new metrics.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124995486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Non-vital loads 它加载

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995707

R. Rakvic, B. Black, D. Limaye, John Paul Shen

{"title":"Non-vital loads","authors":"R. Rakvic, B. Black, D. Limaye, John Paul Shen","doi":"10.1109/HPCA.2002.995707","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995707","url":null,"abstract":"As the frequency gap between main memory and modern microprocessor grows, the implementation and efficiency of on-chip caches become more important. The growing latency to memory is motivating new research into load instruction behavior and selective data caching. This work investigates the classification of load instruction behavior. A new load classification method is proposed that classifies loads into those vital to performance and those not vital to performance. A limit study is presented to characterize different types of non-vital loads and to quantify the percentage of loads that are non-vital. Finally, a realistic implementation of the non-vital load classification method is presented and a new cache structure called the Vital Cache is proposed to take advantage of non-vital loads. The Vital Cache caches data for vital loads only, deferring non-vital loads to slower caches. Results: The limit study shows 75% of all loads are non-vital with only 35% of the accessed data space being vital for caching. The Vital Cache improves the efficiency of the cache hierarchy and the hit rate for vital loads. The Vital Cache increases performance by 17%.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131512444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Evaluation of a multithreaded architecture for cellular computing 一个用于蜂窝计算的多线程体系结构的评估

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995720

Calin Cascaval, J. Castaños, L. Ceze, Monty Denneau, Manish Gupta, D. Lieber, J. Moreira, K. Strauss, H. S. Warren

{"title":"Evaluation of a multithreaded architecture for cellular computing","authors":"Calin Cascaval, J. Castaños, L. Ceze, Monty Denneau, Manish Gupta, D. Lieber, J. Moreira, K. Strauss, H. S. Warren","doi":"10.1109/HPCA.2002.995720","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995720","url":null,"abstract":"Cyclops is a new architecture for high-performance parallel computers that is being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP (symmetric multiprocessor) system with multiple threads of execution, embedded memory and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper, we describe the Cyclops architecture and evaluate two of its new hardware features: a memory hierarchy with a flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40 GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high-performance systems.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"49 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131550921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation Itanium处理器的内存延迟容忍方法:乱序执行与推测性预计算

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995709

P. Wang, Hong Wang, Jamison D. Collins, Edward T. Grochowski, R. Kling, John Paul Shen

{"title":"Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation","authors":"P. Wang, Hong Wang, Jamison D. Collins, Edward T. Grochowski, R. Kling, John Paul Shen","doi":"10.1109/HPCA.2002.995709","DOIUrl":"https://doi.org/10.1109/HPCA.2002.995709","url":null,"abstract":"The performance of in-order execution Itanium/sup TM/ processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an out-of-order design (87%). Applying both 000 and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching-for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.","PeriodicalId":408620,"journal":{"name":"Proceedings Eighth International Symposium on High Performance Computer Architecture","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133139600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Bandwidth adaptive snooping 带宽自适应监听

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995715

Milo M. K. Martin, Daniel J. Sorin, M. Hill, D. Wood

引用次数: 67

Using complete machine simulation for software power estimation: the SoftWatt approach 用整机模拟进行软件功率估计:SoftWatt方法

Proceedings Eighth International Symposium on High Performance Computer Architecture Pub Date : 2002-02-02 DOI: 10.1109/HPCA.2002.995705

S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, N. Vijaykrishnan, M. Kandemir, Tao Li, L. John

引用次数: 237