Acceleration of Communication-Aware Task Mapping Techniques through GPU Computing

2013 27th International Conference on Advanced Information Networking and Applications Workshops Pub Date : 2013-03-25 DOI:10.1109/WAINA.2013.38

Javier Reyes, J. Orduña, G. Vigueras, Rafael Tornero

{"title":"Acceleration of Communication-Aware Task Mapping Techniques through GPU Computing","authors":"Javier Reyes, J. Orduña, G. Vigueras, Rafael Tornero","doi":"10.1109/WAINA.2013.38","DOIUrl":null,"url":null,"abstract":"The era of distributed computing, where applications are executed on platforms like clusters, grids and/or clouds of computers, have shown the need for taking into account the communications that take place on distributed computer architectures when executing applications. In that environment, different communication-aware mapping techniques were proposed for improving the system performance, both for off-chip and for on-chip networks. Some of these proposals are based on heuristic search for finding pseudo-optimal assignments of a given population of tasks and processing elements. The technology improvement has allowed a significant increase in the problem size, multiplying the number of processor cores in each chip. Therefore, the proposals based on heuristic search must be accelerated in order to search in larger exploration domains within the same execution times. In this paper, we propose a comparative study of the parallel version of the local search method for communication-aware task mapping techniques. Unlike other comparative studies of heuristic methods implemented on GPUs, we compare the performance provided by the parallel version for GPUs with the performance provided by a MPI parallel version in terms of execution times and fitness values provided. The MPI version was executed on a cluster optimized for MPI applications. Also, we have considered a GPU with Fermi architecture and we have mapped the local search algorithm onto the GPU in order to improve the performance. The results show that the parallel implementation on a single GPU provides similar fitness function values than the MPI implementation on the cluster. However, the execution times required by the GPU implementation are significantly lower than the ones required by the MPI implementation, and these differences increase as so does size of the parallel system.","PeriodicalId":359251,"journal":{"name":"2013 27th International Conference on Advanced Information Networking and Applications Workshops","volume":"14 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 27th International Conference on Advanced Information Networking and Applications Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAINA.2013.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The era of distributed computing, where applications are executed on platforms like clusters, grids and/or clouds of computers, have shown the need for taking into account the communications that take place on distributed computer architectures when executing applications. In that environment, different communication-aware mapping techniques were proposed for improving the system performance, both for off-chip and for on-chip networks. Some of these proposals are based on heuristic search for finding pseudo-optimal assignments of a given population of tasks and processing elements. The technology improvement has allowed a significant increase in the problem size, multiplying the number of processor cores in each chip. Therefore, the proposals based on heuristic search must be accelerated in order to search in larger exploration domains within the same execution times. In this paper, we propose a comparative study of the parallel version of the local search method for communication-aware task mapping techniques. Unlike other comparative studies of heuristic methods implemented on GPUs, we compare the performance provided by the parallel version for GPUs with the performance provided by a MPI parallel version in terms of execution times and fitness values provided. The MPI version was executed on a cluster optimized for MPI applications. Also, we have considered a GPU with Fermi architecture and we have mapped the local search algorithm onto the GPU in order to improve the performance. The results show that the parallel implementation on a single GPU provides similar fitness function values than the MPI implementation on the cluster. However, the execution times required by the GPU implementation are significantly lower than the ones required by the MPI implementation, and these differences increase as so does size of the parallel system.

查看原文本刊更多论文

利用GPU计算加速感知通信的任务映射技术

在分布式计算时代，应用程序在诸如集群、网格和/或计算机云这样的平台上执行，这表明在执行应用程序时需要考虑发生在分布式计算机架构上的通信。在这种环境下，提出了不同的通信感知映射技术，以提高片外和片内网络的系统性能。其中一些建议是基于启发式搜索来寻找给定任务和处理元素的伪最优分配。技术的改进使得问题的规模显著增加，每个芯片上的处理器内核数量成倍增加。因此，基于启发式搜索的建议必须加快速度，以便在相同的执行时间内搜索更大的探索域。在本文中，我们提出了一个比较研究的并行版本的局部搜索方法的通信感知任务映射技术。与其他在gpu上实现的启发式方法的比较研究不同，我们比较了gpu的并行版本与MPI并行版本在执行时间和提供的适应度值方面提供的性能。MPI版本在针对MPI应用程序优化的集群上执行。此外，我们还考虑了一个具有费米架构的GPU，并将局部搜索算法映射到GPU上，以提高性能。结果表明，在单个GPU上的并行实现比在集群上的MPI实现提供相似的适应度函数值。然而，GPU实现所需的执行时间明显低于MPI实现所需的执行时间，并且这些差异随着并行系统的大小而增加。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 27th International Conference on Advanced Information Networking and Applications Workshops

自引率

0.00%

发文量