用于平衡处理器阵列上的I/O和内存访问的符号循环并行化

Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig
{"title":"用于平衡处理器阵列上的I/O和内存访问的符号循环并行化","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig","doi":"10.1109/MEMCOD.2015.7340486","DOIUrl":null,"url":null,"abstract":"Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.","PeriodicalId":106851,"journal":{"name":"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays\",\"authors\":\"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig\",\"doi\":\"10.1109/MEMCOD.2015.7340486\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.\",\"PeriodicalId\":106851,\"journal\":{\"name\":\"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MEMCOD.2015.7340486\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MEMCOD.2015.7340486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

使用一级平铺的大规模并行处理器阵列的循环并行化技术通常受到I/O或内存限制,超出了目标体系结构的能力。此外,如果可用处理元素的数量仅在运行时才知道——就像在自适应系统中一样——静态方法就会失败。为了解决这些问题,我们提出了一种混合编译/运行技术,以符号并行化在多个级别上具有统一依赖关系的循环巢。在编译时,执行两个新的转换:(a)符号分层平铺,然后(b)符号多级调度。通过在多个级别上调整磁贴的大小,可以在必要的I/ o带宽和内存之间进行权衡,这有助于遵守资源约束。由此产生的时间表对于瓷砖的数量是象征性的;因此,在编译时不需要知道要映射到的处理元素的数量。在运行时,当数目已知时,一个简单的prolog根据I/O和内存约束选择一个可行的调度,该调度对于所选的磁贴大小来说是延迟最优的。通过这种方式,我们的方法动态地选择延迟最优和可行的调度,同时避免昂贵的重新编译。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays
Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信