Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques最新文献

筛选
英文 中文
Boolean formula-based branch prediction for future technologies 基于布尔公式的未来技术分支预测
Daniel A. Jiménez, H. Hanson, Calvin Lin
{"title":"Boolean formula-based branch prediction for future technologies","authors":"Daniel A. Jiménez, H. Hanson, Calvin Lin","doi":"10.1109/PACT.2001.953291","DOIUrl":"https://doi.org/10.1109/PACT.2001.953291","url":null,"abstract":"We present a new method for branch prediction that encodes in the branch instruction a formula, chosen by profiling, that is used to perform history-based prediction. By using a special class of Boolean formulas, our encoding is extremely concise. By replacing the large tables found in current predictors with a small, fast circuit, our scheme is ideally suited to fixture technologies that will have large wire delays. In a projected 70 nm technology and an aggressive clock rate of about 5 GHz, an implementation of our method that uses an 8-bit formula encoding has a misprediction rate of 6.0%, 42% lower than that of the best gshare predictor implementable in that same technology. In today's technology; a 16-bit version of our predictor can replace bias bits in an 8K-entry agree predictor to achieve a 2.86% misprediction rate, which is slightly lower than the 2.93% misprediction rate of the Alpha 21264 hybrid predictor, even though the Alpha predictor has almost twice the hardware budget. Our predictor also consumes much less power than table-based predictors. The paper describes our predictor, explains our profiling algorithm, and presents experimental results using the SPEC 2000 integer benchmarks.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130815670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data flow analysis for software prefetching linked data structures in Java Java预取链接数据结构软件的数据流分析
B. Cahoon, K. McKinley
{"title":"Data flow analysis for software prefetching linked data structures in Java","authors":"B. Cahoon, K. McKinley","doi":"10.1109/PACT.2001.953309","DOIUrl":"https://doi.org/10.1109/PACT.2001.953309","url":null,"abstract":"Describes an effective compile-time analysis for software prefetching in Java. Previous work in software data prefetching for pointer-based codes asses simple compiler algorithms and does not investigate prefetching for object-oriented language features that male compile-time analysis difficult. We develop a new data flow analysis to detect regular accesses to linked data structures in Java programs. We use intra- and inter-procedural analysis to identify profitable prefetching opportunities for greedy and jump-pointer prefetching, and we implement these techniques in a compiler for Java. Our results show that both prefetching techniques improve four of our ten programs. The largest performance improvement is 48% with jump-pointers, but consistent improvements are difficult to obtain.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133314689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 158
Basic block distribution analysis to find periodic behavior and simulation points in applications 基本块分布分析,以找到周期性的行为和模拟点在应用程序
T. Sherwood, Erez Perelman, B. Calder
{"title":"Basic block distribution analysis to find periodic behavior and simulation points in applications","authors":"T. Sherwood, Erez Perelman, B. Calder","doi":"10.1109/PACT.2001.953283","DOIUrl":"https://doi.org/10.1109/PACT.2001.953283","url":null,"abstract":"Modern architecture research relies heavily on detailed pipeline simulation. Simulating the full execution of an industry standard benchmark can take weeks to months to complete. To overcome this problem researchers choose a very small portion of a program's execution to evaluate their results, rather than simulating the entire program. In this paper we propose Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution. This approach is based upon using profiles of a program's code structure (basic blocks) to uniquely identify different phases of execution in the program. We show that the periodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics (e.g., IPC, branch miss rate, cache miss rate, value misprediction, address misprediction, and reorder buffer occupancy). Since basic block frequencies can be collected using very fast profiling tools, our approach provides a practical technique for finding the periodicity and simulation points in applications.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126961150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 609
Compiling for the Impulse memory controller 编译脉冲存储器控制器
Xianglong Huang, Zhenlin Wang, K. McKinley
{"title":"Compiling for the Impulse memory controller","authors":"Xianglong Huang, Zhenlin Wang, K. McKinley","doi":"10.1109/PACT.2001.953295","DOIUrl":"https://doi.org/10.1109/PACT.2001.953295","url":null,"abstract":"The Impulse memory controller provides an interface for remapping irregular or sparse memory accesses into dense accesses in the cache memory. This capability significantly increases processor cache and system bus utilization, and previous work shows performance improvements from a factor of 1.2 to 5 with current technology models for hand-coded kernels in a cycle-level simulator. To attain widespread use of any specialized hardware feature requires automating its use in a compiler. We present compiler cost models using dependence and locality analysis that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping. We implement the cost models and generate the appropriate Impulse system calls in the Scale compiler framework. Our results demonstrate that our cost models correctly choose when and when not to use Impulse. We also combine and compare Impulse with our implementation of loop permutation for improving locality. If loop permutation can achieve the same dense access pattern as Impulse, we prefer it, since it has no overheads, but we show that the combination can yield better performance.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125572145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Code reordering and speculation support for dynamic optimization systems 对动态优化系统的代码重排序和推测支持
E. Nystrom, R. D. Barnes, M. Merten, W. Hwu
{"title":"Code reordering and speculation support for dynamic optimization systems","authors":"E. Nystrom, R. D. Barnes, M. Merten, W. Hwu","doi":"10.1109/PACT.2001.953297","DOIUrl":"https://doi.org/10.1109/PACT.2001.953297","url":null,"abstract":"For dynamic optimization systems, success is limited by two difficult problems arising from instruction reordering. Following optimization within and across basic block boundaries, both the ordering of exceptions and the observed processor register contents at each exception point must be consistent with the original code. While compilers traditionally utilize global data flow analysis to determine which registers require preservation, this analysis is often infeasible in dynamic optimization systems due to both strict time/space constraints and incomplete code discovery. This paper presents an approach called precise speculation that addresses these problems. The proposed mechanism is a component of our vision for Run-time Optimization ARchitecture, or ROAR, to support aggressive dynamic optimization of programs. It utilizes a hardware mechanism to automatically recover the precise register states when a deferred exception is reported, utilizing the original unoptimized code to perform all recovery. We observe that precise speculation enables a dynamic optimization system to achieve a large performance gain over aggressively optimized base code, while preserving precise exceptions. For an 8-issue EPIC processor, the dynamic optimizer achieves between 3.6% and 57% speedup over a full-strength optimizing compiler that employs profile-guided optimization.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123984209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Filtering techniques to improve trace-cache efficiency 提高跟踪缓存效率的过滤技术
Roni Rosner, A. Mendelson, R. Ronen
{"title":"Filtering techniques to improve trace-cache efficiency","authors":"Roni Rosner, A. Mendelson, R. Ronen","doi":"10.1109/PACT.2001.953286","DOIUrl":"https://doi.org/10.1109/PACT.2001.953286","url":null,"abstract":"The trace cache is becoming an important building block of modern, wide-issue, processors. The paper has three main contributions: it indicates that trace cache optimizations directed to reducing power consumption are do not necessarily coincide with optimisations directed to increasing fetch bandwidth; it extends our understanding on how well the trace cache utilizes its resources and introduces a new trace-cache organization based on filtering techniques. We observe that: (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. An extension of the filtering concept involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115588894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes 多链预取:有效地利用了指针跟踪代码的链间内存并行性
Nicholas Kohout, Seungryul Choi, Dongkeun Kim, D. Yeung
{"title":"Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes","authors":"Nicholas Kohout, Seungryul Choi, Dongkeun Kim, D. Yeung","doi":"10.1109/PACT.2001.953307","DOIUrl":"https://doi.org/10.1109/PACT.2001.953307","url":null,"abstract":"Presents multi-chain prefetching, a technique that utilizes offline analysis and a hardware prefetch engine to prefetch multiple independent pointer chains simultaneously, thus exploiting inter-chain memory parallelism for the purpose of memory latency tolerance. This paper makes three contributions. First, we introduce a scheduling algorithm that identifies independent pointer chains in pointer-chasing codes and computes a prefetch schedule that overlaps serialized cache misses across separate chains. Our analysis focuses an static traversals. We also propose using speculation to identify independent pointer chains in dynamic traversals. Second, we present the design of a prefetch engine that traverses pointer-based data structures and overlaps multiple pointer chains according to our scheduling algorithm. Finally, we conduct an experimental evaluation of multi-chain prefetching and compare its performance against two existing techniques: jump pointer prefetching and prefetch arrays. Our results show that multi-chain prefetching improves the execution time by 40% across six pointer-chasing kernels from the Olden benchmark suite and by 8% across four SPECInt CPU2000 benchmarks. Multi-chain prefetching also outperforms jump pointer prefetching and prefetch arrays by 28% on Olden, and by 12% on SPECInt. Furthermore, speculation can enable multi-chain prefetching for some dynamic traversal codes, but our technique loses its effectiveness when the pointer-chain traversal order is unpredictable. Finally, we also show that combining multi-chain prefetching with prefetch arrays can potentially provide higher performance than either technique alone.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"129 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130071696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Modeling superscalar processors via statistical simulation 通过统计仿真对超标量处理器进行建模
Sébastien Nussbaum, James E. Smith
{"title":"Modeling superscalar processors via statistical simulation","authors":"Sébastien Nussbaum, James E. Smith","doi":"10.1109/PACT.2001.953284","DOIUrl":"https://doi.org/10.1109/PACT.2001.953284","url":null,"abstract":"Statistical simulation is a technique for fast performance evaluation of superscalar processors. First, intrinsic statistical information is collected from a single detailed simulation of a program. This information is then used to generate a synthetic instruction trace that is fed to a simple processor model, along with cache and branch prediction statistics. Because of the probabilistic nature of the simulation, it quickly converges to a performance rate. The simplicity and simulation speed make it useful for fast design space exploration; as such, it is a good complement to conventional detailed simulation. The accuracy of this technique is evaluated for different levels of modeling complexity. Both errors and convergence properties are studied in detail. A simple instruction model yields an average error of 8% compared with detailed simulation. A more detailed instruction model reduces the error to 5% but requires about three times as long to converge.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131751116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 203
Exploring the design space of future CMPs 探索未来cmp的设计空间
Jaehyuk Huh, D. Burger, S. Keckler
{"title":"Exploring the design space of future CMPs","authors":"Jaehyuk Huh, D. Burger, S. Keckler","doi":"10.1109/PACT.2001.953300","DOIUrl":"https://doi.org/10.1109/PACT.2001.953300","url":null,"abstract":"We study the space of chip multiprocessor (CMP) organizations. We compare the area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issues, and how big the per-processor on-chip caches should be. We find that, contrary to some conventional wisdom, out-of-order processing cores will maximize job throughput on future CMPs. As technology shrinks, limited off-chip bandwidth will begin to curtail the number of cores that can be effective on a single die. Current projections show that the transistor/signal pin ratio will increase by a factor of 45 between 180 and 35 nanometer technologies. That disparity will force increases in per-processor cache capacities as technology shrinks, from 128KB at 100nm, to 256KB at 70nm, and to 1MB at 50 and 35nm, reducing the number of cores that would otherwise be possible.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133794612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 197
Optimizing software data prefetches with rotating registers 优化软件数据预取与旋转寄存器
Gautam Doshi, R. Krishnaiyer, Kalyan Muthukumar
{"title":"Optimizing software data prefetches with rotating registers","authors":"Gautam Doshi, R. Krishnaiyer, Kalyan Muthukumar","doi":"10.1109/PACT.2001.953306","DOIUrl":"https://doi.org/10.1109/PACT.2001.953306","url":null,"abstract":"Software data prefetching is a well-known technique to improve the performance of programs that suffer many cache misses at several levels of memory hierarchy. However, it has significant overhead in terms of increased code size, additional instructions, and possibly increased memory bus traffic due to redundant prefetches. This paper presents two novel methods to reduce the overhead of software data prefetching and improve the program performance by optimized prefetch scheduling. These methods exploit the availability of rotating registers and predication in architectures such as the Itanium/sup TM/ architecture. The methods (I) minimize redundant prefetches, (2) reduce the number of issue slots needed for prefetch instructions, and (3) avoid branch mispredict penalties - all with minimal code size increase. Compared to traditional data prefetching techniques, these methods (i) do not require loop unrolling, (ii) do not require predicate computations and (iii) require fewer machine resources. One of these methods has been implemented in the Intel Production Compiler for the ItaniumTM processor. This technique is compared with traditional approaches for software prefetching and experimental results are presented based on the floating-point benchmark suite of CPU2000.","PeriodicalId":276650,"journal":{"name":"Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124311878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信