用于平衡处理器阵列上的I/O和内存访问的符号循环并行化

2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE) Pub Date : 2015-12-03 DOI:10.1109/MEMCOD.2015.7340486

Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig

{"title":"用于平衡处理器阵列上的I/O和内存访问的符号循环并行化","authors":"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig","doi":"10.1109/MEMCOD.2015.7340486","DOIUrl":null,"url":null,"abstract":"Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.","PeriodicalId":106851,"journal":{"name":"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays\",\"authors\":\"Alexandru Tanase, Michael Witterauf, J. Teich, Frank Hannig\",\"doi\":\"10.1109/MEMCOD.2015.7340486\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.\",\"PeriodicalId\":106851,\"journal\":{\"name\":\"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MEMCOD.2015.7340486\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MEMCOD.2015.7340486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

使用一级平铺的大规模并行处理器阵列的循环并行化技术通常受到I/O或内存限制，超出了目标体系结构的能力。此外，如果可用处理元素的数量仅在运行时才知道——就像在自适应系统中一样——静态方法就会失败。为了解决这些问题，我们提出了一种混合编译/运行技术，以符号并行化在多个级别上具有统一依赖关系的循环巢。在编译时，执行两个新的转换:(a)符号分层平铺，然后(b)符号多级调度。通过在多个级别上调整磁贴的大小，可以在必要的I/ o带宽和内存之间进行权衡，这有助于遵守资源约束。由此产生的时间表对于瓷砖的数量是象征性的;因此，在编译时不需要知道要映射到的处理元素的数量。在运行时，当数目已知时，一个简单的prolog根据I/O和内存约束选择一个可行的调度，该调度对于所选的磁贴大小来说是延迟最优的。通过这种方式，我们的方法动态地选择延迟最优和可行的调度，同时避免昂贵的重新编译。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays

Loop parallelization techniques for massively parallel processor arrays using one-level tiling are often either I/O- or memory-bounded, exceeding the target architecture's capabilities. Furthermore, if the number of available processing elements is only known at runtime - as in adaptive systems - static approaches fail. To solve these problems, we present a hybrid compile/runtime technique to symbolically parallelize loop nests with uniform dependences on multiple levels. At compile time, two novel transformations are performed: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. By tuning the size of the tiles on multiple levels, a trade-off between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the number of tiles; thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number is known, a simple prolog chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In this way, our approach dynamically chooses latency-optimal and feasible schedules while avoiding expensive re-compilations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE)

自引率

0.00%

发文量