基于切片的代码并行化最小化处理器间通信

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2009-10-11 DOI:10.1145/1629395.1629409

M. Kandemir, Yuanrui Zhang, Sai Prashanth Muralidhara, O. Ozturk, S. Narayanan

{"title":"基于切片的代码并行化最小化处理器间通信","authors":"M. Kandemir, Yuanrui Zhang, Sai Prashanth Muralidhara, O. Ozturk, S. Narayanan","doi":"10.1145/1629395.1629409","DOIUrl":null,"url":null,"abstract":"One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Slicing based code parallelization for minimizing inter-processor communication\",\"authors\":\"M. Kandemir, Yuanrui Zhang, Sai Prashanth Muralidhara, O. Ozturk, S. Narayanan\",\"doi\":\"10.1145/1629395.1629409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested.\",\"PeriodicalId\":136293,\"journal\":{\"name\":\"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1629395.1629409\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629395.1629409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分布式内存多核体系结构中的一个关键问题是最小化处理器间通信的可伸缩并行化。利用迭代空间切片的概念，提出了一种新的数据密集型应用程序的代码并行化方案。该方案以分布式内存多核体系结构为目标，利用切片技术提出了跨并行处理器的数据计算分布(分区)问题，从输出数组的分区开始，迭代地确定其他数组的分区以及应用程序代码中循环巢的迭代空间。目标是尽量减少处理器间的数据通信。基于这种基于迭代空间切片的问题表述，我们还提出了一种解决方案。使用六个数据密集型基准程序对所提出的数据计算方案进行了评估。在我们的实验评估中，我们还将该方案与三种替代的数据计算分布方案进行了比较。获得的结果非常令人鼓舞，当对我们测试的所有基准代码进行平均时，表明使用16个处理器比次优方案提高了大约10%的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Slicing based code parallelization for minimizing inter-processor communication

One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems

自引率

0.00%

发文量