Instruction combining for coalescing memory accesses using global code motion

Memory System Performance Pub Date : 2004-06-08 DOI:10.1145/1065895.1065897

M. Kawahito, H. Komatsu, T. Nakatani

{"title":"Instruction combining for coalescing memory accesses using global code motion","authors":"M. Kawahito, H. Komatsu, T. Nakatani","doi":"10.1145/1065895.1065897","DOIUrl":null,"url":null,"abstract":"Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we use it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider reference for taking advantage of a wide-bus architecture. Coalescing memory accesses can improve performance for two reasons: one by reducing the additional cycles required for moving data from caches to registers and the other by reducing the stall cycles caused by multiple outstanding memory access requests. Previous approaches for memory access coalescing focus only on array access instructions related to loop induction variables, and thus they miss many other opportunities. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in the IBM Java™ JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. Moreover, when we replace every declaration of double for an instance variable with float, we can improve the performance by 7.3% for the MolDyn benchmark in the JavaGrande benchmark suite. Our approach can be applied to a variety of architectures and to programming languages besides Java.","PeriodicalId":365109,"journal":{"name":"Memory System Performance","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Memory System Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1065895.1065897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we use it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider reference for taking advantage of a wide-bus architecture. Coalescing memory accesses can improve performance for two reasons: one by reducing the additional cycles required for moving data from caches to registers and the other by reducing the stall cycles caused by multiple outstanding memory access requests. Previous approaches for memory access coalescing focus only on array access instructions related to loop induction variables, and thus they miss many other opportunities. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in the IBM Java™ JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. Moreover, when we replace every declaration of double for an instance variable with float, we can improve the performance by 7.3% for the MolDyn benchmark in the JavaGrande benchmark suite. Our approach can be applied to a variety of architectures and to programming languages besides Java.

查看原文本刊更多论文

使用全局代码运动合并内存访问的指令组合

指令组合是一种优化，用更有效的指令替换指令序列，在更少的机器周期内产生相同的结果。当我们使用它来合并内存访问时，我们可以通过将具有连续地址的窄内存引用组合成更宽的引用来减少内存流量，从而利用宽总线架构。合并内存访问可以提高性能，原因有两个:一是减少将数据从缓存移动到寄存器所需的额外周期，二是减少由多个未完成的内存访问请求引起的停机周期。以前的内存访问合并方法只关注与循环诱导变量相关的数组访问指令，因此它们错过了许多其他机会。在本文中，我们提出了一种新的指令组合算法，该算法将全局代码运动应用于给定程序的更广泛区域，以寻找更多潜在的候选对象。我们在IA-64的IBM Java™JIT编译器中使用我们的算法，为合并内存访问实现了两个优化，一个结合了两个32位整数负载，另一个结合了两个单精度浮点负载，并通过测量SPECjvm98基准套件对它们进行了评估。在我们的实验中，我们可以在很少的额外编译时间开销的情况下将最大性能提高5.5%。此外，当我们用float替换实例变量的每个double声明时，我们可以将JavaGrande基准套件中的MolDyn基准的性能提高7.3%。我们的方法可以应用于除Java之外的各种体系结构和编程语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Memory System Performance

自引率

0.00%

发文量