Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques最新文献_第2页

Boolean formula-based branch prediction for future technologies 基于布尔公式的未来技术分支预测

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953291

Daniel A. Jiménez, H. Hanson, Calvin Lin

{"title":"Boolean formula-based branch prediction for future technologies","authors":"Daniel A. Jiménez, H. Hanson, Calvin Lin","doi":"10.1109/PACT.2001.953291","DOIUrl":"https://doi.org/10.1109/PACT.2001.953291","url":null,"abstract":"We present a new method for branch prediction that encodes in the branch instruction a formula, chosen by profiling, that is used to perform history-based prediction. By using a special class of Boolean formulas, our encoding is extremely concise. By replacing the large tables found in current predictors with a small, fast circuit, our scheme is ideally suited to fixture technologies that will have large wire delays. In a projected 70 nm technology and an aggressive clock rate of about 5 GHz, an implementation of our method that uses an 8-bit formula encoding has a misprediction rate of 6.0%, 42% lower than that of the best gshare predictor implementable in that same technology. In today's technology; a 16-bit version of our predictor can replace bias bits in an 8K-entry agree predictor to achieve a 2.86% misprediction rate, which is slightly lower than the 2.93% misprediction rate of the Alpha 21264 hybrid predictor, even though the Alpha predictor has almost twice the hardware budget. Our predictor also consumes much less power than table-based predictors. The paper describes our predictor, explains our profiling algorithm, and presents experimental results using the SPEC 2000 integer benchmarks.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130815670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Data flow analysis for software prefetching linked data structures in Java Java预取链接数据结构软件的数据流分析

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953309

B. Cahoon, K. McKinley

引用次数: 158

Basic block distribution analysis to find periodic behavior and simulation points in applications 基本块分布分析，以找到周期性的行为和模拟点在应用程序

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953283

T. Sherwood, Erez Perelman, B. Calder

{"title":"Basic block distribution analysis to find periodic behavior and simulation points in applications","authors":"T. Sherwood, Erez Perelman, B. Calder","doi":"10.1109/PACT.2001.953283","DOIUrl":"https://doi.org/10.1109/PACT.2001.953283","url":null,"abstract":"Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution. This approach is based upon using profiles of a program's code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126961150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 609

Compiling for the Impulse memory controller 编译脉冲存储器控制器

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953295

Xianglong Huang, Zhenlin Wang, K. McKinley

{"title":"Compiling for the Impulse memory controller","authors":"Xianglong Huang, Zhenlin Wang, K. McKinley","doi":"10.1109/PACT.2001.953295","DOIUrl":"https://doi.org/10.1109/PACT.2001.953295","url":null,"abstract":"The Impulse memory controller provides an interface for remapping irregular or sparse memory accesses into dense accesses in the cache memory. This capability significantly increases processor cache and system bus utilization, and previous work shows performance improvements from a factor of 1.2 to 5 with current technology models for hand-coded kernels in a cycle-level simulator. To attain widespread use of any specialized hardware feature requires automating its use in a compiler. We present compiler cost models using dependence and locality analysis that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping. We implement the cost models and generate the appropriate Impulse system calls in the Scale compiler framework. Our results demonstrate that our cost models correctly choose when and when not to use Impulse. We also combine and compare Impulse with our implementation of loop permutation for improving locality. If loop permutation can achieve the same dense access pattern as Impulse, we prefer it, since it has no overheads, but we show that the combination can yield better performance.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125572145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Code reordering and speculation support for dynamic optimization systems 对动态优化系统的代码重排序和推测支持

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953297

E. Nystrom, R. D. Barnes, M. Merten, W. Hwu

{"title":"Code reordering and speculation support for dynamic optimization systems","authors":"E. Nystrom, R. D. Barnes, M. Merten, W. Hwu","doi":"10.1109/PACT.2001.953297","DOIUrl":"https://doi.org/10.1109/PACT.2001.953297","url":null,"abstract":"For dynamic optimization systems, success is limited by two difficult problems arising from instruction reordering. Following optimization within and across basic block boundaries, both the ordering of exceptions and the observed processor register contents at each exception point must be consistent with the original code. While compilers traditionally utilize global data flow analysis to determine which registers require preservation, this analysis is often infeasible in dynamic optimization systems due to both strict time/space constraints and incomplete code discovery. This paper presents an approach called precise speculation that addresses these problems. The proposed mechanism is a component of our vision for Run-time Optimization ARchitecture, or ROAR, to support aggressive dynamic optimization of programs. It utilizes a hardware mechanism to automatically recover the precise register states when a deferred exception is reported, utilizing the original unoptimized code to perform all recovery. We observe that precise speculation enables a dynamic optimization system to achieve a large performance gain over aggressively optimized base code, while preserving precise exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves between 3.6% and 57% speedup over a full-strength optimizing compiler that employs profile-guided optimization.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123984209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Filtering techniques to improve trace-cache efficiency 提高跟踪缓存效率的过滤技术

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953286

Roni Rosner, A. Mendelson, R. Ronen

{"title":"Filtering techniques to improve trace-cache efficiency","authors":"Roni Rosner, A. Mendelson, R. Ronen","doi":"10.1109/PACT.2001.953286","DOIUrl":"https://doi.org/10.1109/PACT.2001.953286","url":null,"abstract":"The trace cache is becoming an important building block of modern, wide-issue, processors. The paper has three main contributions: it indicates that trace cache optimizations directed to reducing power consumption are do not necessarily coincide with optimisations directed to increasing fetch bandwidth; it extends our understanding on how well the trace cache utilizes its resources and introduces a new trace-cache organization based on filtering techniques. We observe that: (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. An extension of the filtering concept involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115588894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes 多链预取:有效地利用了指针跟踪代码的链间内存并行性

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953307

Nicholas Kohout, Seungryul Choi, Dongkeun Kim, D. Yeung

{"title":"Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes","authors":"Nicholas Kohout, Seungryul Choi, Dongkeun Kim, D. Yeung","doi":"10.1109/PACT.2001.953307","DOIUrl":"https://doi.org/10.1109/PACT.2001.953307","url":null,"abstract":"Presents multi-chain prefetching, a technique that utilizes offline analysis and a hardware prefetch engine to prefetch multiple independent pointer chains simultaneously, thus exploiting inter-chain memory parallelism for the purpose of memory latency tolerance. This paper makes three contributions. First, we introduce a scheduling algorithm that identifies independent pointer chains in pointer-chasing codes and computes a prefetch schedule that overlaps serialized cache misses across separate chains. Our analysis focuses an static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Second, we present the design of a prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to our scheduling algorithm. Finally, we conduct an experimental evaluation of multi-chain prefetching and compare its performance against two existing techniques: jump pointer prefetching and prefetch arrays. Our results show that multi-chain prefetching improves the execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite and by 8% across four SPECInt CPU2000 benchmarks. Multi-chain prefetching also outperforms jump pointer prefetching and prefetch arrays by 28% on Olden, and by 12% on SPECInt. Furthermore, speculation can enable multi-chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is unpredictable. Finally, we also show that combining multi-chain prefetching with prefetch arrays can potentially provide higher performance than either technique alone.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"129 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130071696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Modeling superscalar processors via statistical simulation 通过统计仿真对超标量处理器进行建模

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953284

Sébastien Nussbaum, James E. Smith

引用次数: 203

Exploring the design space of future CMPs 探索未来cmp的设计空间

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953300

Jaehyuk Huh, D. Burger, S. Keckler

引用次数: 197

Optimizing software data prefetches with rotating registers 优化软件数据预取与旋转寄存器

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2001-09-08 DOI: 10.1109/PACT.2001.953306

Gautam Doshi, R. Krishnaiyer, Kalyan Muthukumar

{"title":"Optimizing software data prefetches with rotating registers","authors":"Gautam Doshi, R. Krishnaiyer, Kalyan Muthukumar","doi":"10.1109/PACT.2001.953306","DOIUrl":"https://doi.org/10.1109/PACT.2001.953306","url":null,"abstract":"Software data prefetching is a well-known technique to improve the performance of programs that suffer many cache misses at several levels of memory hierarchy. However, it has significant overhead in terms of increased code size, additional instructions, and possibly increased memory bus traffic due to redundant prefetches. This paper presents two novel methods to reduce the overhead of software data prefetching and improve the program performance by optimized prefetch scheduling. These methods exploit the availability of rotating registers and predication in architectures such as the Itanium/sup TM/ architecture. The methods (I) minimize redundant prefetches, (2) reduce the number of issue slots needed for prefetch instructions, and (3) avoid branch mispredict penalties - all with minimal code size increase. Compared to traditional data prefetching techniques, these methods (i) do not require loop unrolling, (ii) do not require predicate computations and (iii) require fewer machine resources. One of these methods has been implemented in the Intel Production Compiler for the ItaniumTM processor. This technique is compared with traditional approaches for software prefetching and experimental results are presented based on the floating-point benchmark suite of CPU2000.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124311878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33