Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI:10.1145/3180270.3180273

T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai

{"title":"Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster","authors":"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai","doi":"10.1145/3180270.3180273","DOIUrl":null,"url":null,"abstract":"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th Workshop on General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3180270.3180273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.

查看原文本刊更多论文

克服了在多gpu集群上大规模生成CGH的困难

长期以来，人们一直期待3D全息显示器成为未来的人机界面，因为它不需要用户佩戴特殊的设备。然而，其庞大的计算需求阻碍了这种显示的实现。最近的一项研究表明，为了实现高分辨率和宽视角，需要实时处理数十亿像素的物体和全息图。针对这个问题，首先，我们将传统的FFT算法调整到GPU集群环境中，以避免繁重的节点间通信。然后，我们应用了几种单节点和多节点优化和并行化技术。单节点优化包括对象分解方式的改变、CPU和GPU之间数据传输的减少、内核集成、流处理以及节点内多GPU的利用率。多节点优化包括对象数据从主机节点到其他节点的分发方法。实验结果表明，节点内优化比原单节点代码提高了11.52倍的速度。此外，使用8个节点，每个节点2个gpu的多节点优化，从3.2千兆像素对象生成1.6千兆像素全息图的执行时间为4.28秒。这意味着使用传统的基于fft的算法，CPU的顺序处理速度提高了237.92倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 11th Workshop on General Purpose GPUs

自引率

0.00%

发文量