自动循环平铺直接内存访问

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.53

Haibo Lin, Tao Liu, Lakshminarayanan Renganarayanan, Huoding Li, Tong Chen, K. O'Brien, Ling Shao

{"title":"自动循环平铺直接内存访问","authors":"Haibo Lin, Tao Liu, Lakshminarayanan Renganarayanan, Huoding Li, Tong Chen, K. O'Brien, Ling Shao","doi":"10.1109/IPDPS.2011.53","DOIUrl":null,"url":null,"abstract":"In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is efficient for regular accesses, while requiring precise analysis, detailed modeling of execution, and significant code generation. In this paper we present the design and implementation of {\\em DMATiler} which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Automatic Loop Tiling for Direct Memory Access\",\"authors\":\"Haibo Lin, Tao Liu, Lakshminarayanan Renganarayanan, Huoding Li, Tong Chen, K. O'Brien, Ling Shao\",\"doi\":\"10.1109/IPDPS.2011.53\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is efficient for regular accesses, while requiring precise analysis, detailed modeling of execution, and significant code generation. In this paper we present the design and implementation of {\\\\em DMATiler} which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.\",\"PeriodicalId\":355100,\"journal\":{\"name\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Parallel & Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2011.53\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

在异构多核系统中，例如Cell BE处理器，每个加速器核心都有自己的快速本地存储器，没有硬件支持的一致性，软件负责在快速本地存储器和慢速全局存储器之间动态传输数据。数据可以通过软件控制的缓存或直接缓存传输。软件控制的缓存维护任意访问模式的正确性，但引入了缓存查找的额外开销。直接缓冲区对于常规访问是有效的，但需要精确的分析、详细的执行建模和重要的代码生成。在本文中，我们提出了{\em datiler}的设计和实现，它结合了编译器分析和运行时管理，通过自动循环平铺和缓冲区优化技术来优化局部内存性能。DMATiler选择一个数据传输友好的循环顺序，并使用经验验证的DMA性能模型，制定并解决了一个凸优化问题，以确定全局最优的瓷砖尺寸。此外，DMATiler应用优化技术，如压缩数据传输和DMA命令，为给定的循环巢实现最佳的DMA性能。我们已经在IBM XL单源编译器(SSC)中实现了DMATiler，并使用一组循环巢基准测试进行了实验。结果表明，DMATiler比Cell BE处理器上的软件控制缓存(平均加速9.8倍)和单级循环阻塞(平均加速6.2倍)效率高得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic Loop Tiling for Direct Memory Access

In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is efficient for regular accesses, while requiring precise analysis, detailed modeling of execution, and significant code generation. In this paper we present the design and implementation of {\em DMATiler} which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more efficient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量