Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.最新文献

筛选
英文 中文
Fast indexing for blocked array layouts to improve multi-level cache locality 快速索引阻塞数组布局,以提高多级缓存局部性
Evangelia Athanasaki, N. Koziris
{"title":"Fast indexing for blocked array layouts to improve multi-level cache locality","authors":"Evangelia Athanasaki, N. Koziris","doi":"10.1109/INTERA.2004.1299515","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299515","url":null,"abstract":"One of the key challenges computer architects and compiler writers are facing, is the increasing discrepancy between processor cycle times and main memory access times. To overcome this problem, program transformations that decrease cache misses are used, to reduce average latency for memory accesses. Tiling is a widely used loop iteration reordering technique for improving locality of references. In this paper, we further reduce cache misses, restructuring the memory layout of multi-dimensional arrays, that are accessed by tiled instruction code. In our method, array elements are stored in a blocked way, exactly as they are swept by the tiled instruction stream. We present a straightforward way to easily translate multi-dimensional indexing of arrays into their blocked memory layout using simple binary-mask operations. Indices for such array layouts are now easily calculated based on the algebra of dilated integers, similarly to morton-order indexing. Actual experimental results on three different hardware platforms, using 5 benchmarks, illustrate that execution time is greatly improved when combining tiled code with tiled array layouts and binary mask-based index translation functions. Both TLB and L1 cache misses are concurrently minimized, for the same tile size, thus, applying the proposed layouts, locality of references is greatly improved. Finally, simulations using the Simplescalar tool, verify that our enhanced performance is due to the considerable reduction of cache misses in all levels of memory hierarchy.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122941449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Garbage collector refinement for new dynamic multimedia applications on embedded systems 嵌入式系统上新的动态多媒体应用的垃圾收集器改进
J. M. Velasco, David Atienza Alonso, F. Catthoor, F. Tirado, Katzalin Olcoz, J. Mendias
{"title":"Garbage collector refinement for new dynamic multimedia applications on embedded systems","authors":"J. M. Velasco, David Atienza Alonso, F. Catthoor, F. Tirado, Katzalin Olcoz, J. Mendias","doi":"10.1109/INTERA.2004.1299507","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299507","url":null,"abstract":"Consumer embedded devices must execute concurrently multiple services (e.g. multimedia applications) that are dynamically triggered by the user. For these new embedded multimedia applications, the dynamic memory subsystem is currently one of the main sources of power consumption and its inattentive management can severely affect the performance and power consumption and its attentive management can severely affect the performance and power consumption of the whole system. Therefore, the use of suitable automatic mechanisms to reuse the dynamic computer storage (i.e. garbage collector mechanisms) taking into account the underlying embedded devices would allow the designers to more efficiently design these systems. However, methodologies to explore and implement convenient garbage collector mechanisms for embedded devices have not been developed yet. In this paper we propose a system-level method to define and explore the vast design space of possible garbage collector mechanisms, which enables to define custom garbage collector implementations for the final embedded devices.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimSnap: fast-forwarding via native execution and application-level checkpointing SimSnap:通过本机执行和应用程序级检查点快速转发
P. Szwed, Daniel Marques, Robert M. Buels, S. Mckee, M. Schulz
{"title":"SimSnap: fast-forwarding via native execution and application-level checkpointing","authors":"P. Szwed, Daniel Marques, Robert M. Buels, S. Mckee, M. Schulz","doi":"10.1109/INTERA.2004.1299511","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299511","url":null,"abstract":"As systems become more complex, conducting cycle-accurate simulation experiments becomes more time consuming. Most approaches to accelerating simulations attempt to choose simulation points, such that the performance of the program portions modeled in detail are representative of whole-program behavior. To maintain or build the correct architectural state, \"fast-forwarding\" models a series of instructions before a desired simulation point. This fast-forwarding is usually performed by functional simulation: modeling the effects of instructions without all the details of pipeline stages and individual /spl mu/-ops. We present another fast-forwarding technique, SimSnap, that leverages native execution and application-level checkpointing. We demonstrate the viability of our approach by moving checkpointed versions of SPLASH-2 benchmarks between an Alpha 21264 system and SimpleScalar Version 4.0 Alpha-Sim. Reduction in experiment times is dramatic, with minimal perturbation of benchmark programs.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126415166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Cool-Fetch: a compiler-enabled IPC estimation based framework for energy reduction Cool-Fetch:一个基于编译器的IPC估计框架,用于降低能耗
O. Unsal, I. Koren, C. M. Krishna, C. A. Moritz
{"title":"Cool-Fetch: a compiler-enabled IPC estimation based framework for energy reduction","authors":"O. Unsal, I. Koren, C. M. Krishna, C. A. Moritz","doi":"10.1109/INTERA.2004.1299509","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299509","url":null,"abstract":"With power consumption becoming an increasingly important factor, it is necessary to reevaluate traditional, power-intensive, architectural techniques and their relative performance benefits. We believe that combined architecture-compiler efforts open up new and efficient ways to retain the performance benefits of modern architectures while addressing their power impact. In this paper, we present Cool-Fetch, an architecture compiler based approach to reduce energy consumption in the processor. While we mainly target the fetch unit, an important side-effect of our approach is that we obtain energy savings in many other parts of the processor. The explanation is that the fetch unit often runs substantially ahead of execution, bringing in instructions to different stages in the processor that may never be executed. We have found that although the degree of instruction level parallelism (ILP) of a program tends to vary over time, it can be statically estimated by the compiler. Our instructions per clock (IPC) estimation scheme uses monotonic dataflow analysis and simple heuristics, to guide a fetch-throttling mechanism. We develop the necessary architecture support and include its power overhead. Using Mediabench and SPEC2000 applications, we obtain up to 15% total energy savings in the processor with generally little performance degradation. We also provide a comparison of Cool-Fetch with previously proposed hardware-only dynamic fetch-throttling schemes.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131325722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Data movement optimization for software-controlled on-chip memory 软件控制的片上存储器的数据移动优化
M. Fujita, Masaaki Kondo, Hiroshi Nakamura
{"title":"Data movement optimization for software-controlled on-chip memory","authors":"M. Fujita, Masaaki Kondo, Hiroshi Nakamura","doi":"10.1109/INTERA.2004.1299516","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299516","url":null,"abstract":"In order to overcome performance degradation caused by performance disparity between processor and main memory, there have been proposed several new VLSI architectures which have software controlled on-chip memory in addition to the conventional cache. However, users must specify data allocation/replacement on software controlled on-chip memory and data transfer between the on-chip and off-chip memories to achieve higher performance by utilizing on-chip memory. Because such properties are automatically controlled by hardware in conventional caches, a cost of optimization for a program becomes a matter that should be considered. In this paper, we propose an data movement optimization technique for software-controlled on-chip memory. We evaluated the proposed method using two applications. The results reveal that the proposed technique can drastically reduce memory stall cycles and achieve high performance.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133615233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploitation of instruction-level parallelism for optimal loop scheduling 指令级并行性在优化循环调度中的应用
Jan Müller, D. Fimmel, R. Merker
{"title":"Exploitation of instruction-level parallelism for optimal loop scheduling","authors":"Jan Müller, D. Fimmel, R. Merker","doi":"10.1109/INTERA.2004.1299506","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299506","url":null,"abstract":"We present a loop scheduling approach which optimally exploits instruction-level parallelism. We develop a flow graph model for the resource constraints allowing a more efficient implementation. The method supports heterogeneous processor architectures and pipelines functional units. Our linear programming implementation produces an optimum loop schedule, making the technique applicable to production compilation and hardware parametrization. Compared to earlier approaches, the approach can provide faster loop schedules and a significant reduction of the problem complexity and solution time.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131917757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Link-time optimization techniques for eliminating conditional branch redundancies 消除条件分支冗余的链路时间优化技术
Manel Fernández, R. Espasa
{"title":"Link-time optimization techniques for eliminating conditional branch redundancies","authors":"Manel Fernández, R. Espasa","doi":"10.1109/INTERA.2004.1299513","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299513","url":null,"abstract":"Optimizations performed at link time or directly applied to final program executables have received increased attention in recent years. This work discusses the discovery and elimination of redundant conditional branches in the context of a link-time optimizer, an optimization that we call conditional branch redundancy elimination (CBRE). Our experiments show that around 20% of conditional branches in a program can be considered redundant because their outcomes can be determined from a previous short dynamic execution frame. Then, we present several CBRE algorithms targeted at optimizing away these redundancies. Our results show that around 5% of the conditional branch redundancy detected can indeed be eliminated, which translates into execution time reductions of around 4%. We also give accurate measures of the impact of applying CBRE in code growth.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127308643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Reducing fetch architecture complexity using procedure inlining 使用过程内联降低获取体系结构的复杂性
Oliverio J. Santana, Alex Ramírez, M. Valero
{"title":"Reducing fetch architecture complexity using procedure inlining","authors":"Oliverio J. Santana, Alex Ramírez, M. Valero","doi":"10.1109/INTERA.2004.1299514","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299514","url":null,"abstract":"Fetch engine performance is seriously limited by the branch prediction table access latency. This fact has lead to the development of hardware mechanisms, like prediction overriding, aimed to tolerate this latency. However, prediction overriding requires additional support and recovery mechanisms, which increases the fetch architecture complexity. In this paper, we show that this increase in complexity can be avoided if the interaction between the fetch architecture and software code optimizations is taken into account. We use aggressive procedure inlining to generate long streams of instructions that are used by the fetch engine as the basic prediction unit. We call instruction stream to a sequence of instructions from the target of a taken branch to the next taken branch. These instruction streams are long enough to feed the execution engine with instructions during multiple cycles, while a new stream prediction is being generated, and thus hiding the prediction table access latency. Our results show that the length of instruction streams compensates the increase in the instruction cache miss rate caused by inlining. We show that, using procedure inlining, the need for a prediction overriding mechanism is avoided, reducing the fetch engine complexity.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126391533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Dynamic management of nursery space organization in generational collection 代际收集中苗圃空间组织的动态管理
J. M. Velasco, A. Ortiz, Katzalin Olcoz, F. Tirado
{"title":"Dynamic management of nursery space organization in generational collection","authors":"J. M. Velasco, A. Ortiz, Katzalin Olcoz, F. Tirado","doi":"10.1109/INTERA.2004.1299508","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299508","url":null,"abstract":"The use of automatic memory management in object-oriented languages like Java is becoming eagerly accepted due to its software engineering benefits, its reduction in programming time and safety aspects. Nevertheless, the complexity of garbage collection results in an important overhead for the virtual machine job. Until now, the strategies in garbage collection have focused in defining and fixing regions in the heap based in different approached and algorithms. Each of these strategies can beat the others depending on the data behavior of a specific application, but they fail to take advantage of the available resources for other cases. There is not a static solution to this problem. In this paper, we present and evaluate two dynamic strategies based in data lifetime that reallocate at run time the reserved space in the nursery of generational Appel collectors. The dynamic tuning of the reserved space produces a drastic reduction in the number of collections and the total collection time and has a clear effect in the final execution time.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133335269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems 嵌入式系统中基于相位的缓存调整方案的能效潜力
Gilles A. Pokam, F. Bodin
{"title":"Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems","authors":"Gilles A. Pokam, F. Bodin","doi":"10.1109/INTERA.2004.1299510","DOIUrl":"https://doi.org/10.1109/INTERA.2004.1299510","url":null,"abstract":"Managing the energy-performance tradeoff has become a major challenge with embedded systems. The cache hierarchy is a typical example where this tradeoff plays a central role. With the increasing level of integration density, a cache can feature millions of transistors, consuming a significant portion of the energy. At the same time however, a cache also permits to significantly improve performance. Configurable caches are becoming \"de-facto\" solution to deal efficiently with these issues. Such caches are equipped with artifacts that enable one size to resize it dynamically. With regard to embedded systems, however, many of these artifacts restrict the configurability at the application level. We propose in this paper to modify the structure of a configurable cache to offer embedded compilers the opportunity to reconfigure it according to a program dynamic phase, rather than on a per-application basis. We show in our experimental results that the proposed scheme has a potential for improving the compiler effectiveness to reduce the energy consumption, while not excessively degrading the performance.","PeriodicalId":262940,"journal":{"name":"Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004.","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134154119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信