克服了在多gpu集群上大规模生成CGH的困难

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI:10.1145/3180270.3180273

T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai

{"title":"克服了在多gpu集群上大规模生成CGH的困难","authors":"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai","doi":"10.1145/3180270.3180273","DOIUrl":null,"url":null,"abstract":"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster\",\"authors\":\"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai\",\"doi\":\"10.1145/3180270.3180273\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.\",\"PeriodicalId\":274320,\"journal\":{\"name\":\"Proceedings of the 11th Workshop on General Purpose GPUs\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th Workshop on General Purpose GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3180270.3180273\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th Workshop on General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3180270.3180273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

长期以来，人们一直期待3D全息显示器成为未来的人机界面，因为它不需要用户佩戴特殊的设备。然而，其庞大的计算需求阻碍了这种显示的实现。最近的一项研究表明，为了实现高分辨率和宽视角，需要实时处理数十亿像素的物体和全息图。针对这个问题，首先，我们将传统的FFT算法调整到GPU集群环境中，以避免繁重的节点间通信。然后，我们应用了几种单节点和多节点优化和并行化技术。单节点优化包括对象分解方式的改变、CPU和GPU之间数据传输的减少、内核集成、流处理以及节点内多GPU的利用率。多节点优化包括对象数据从主机节点到其他节点的分发方法。实验结果表明，节点内优化比原单节点代码提高了11.52倍的速度。此外，使用8个节点，每个节点2个gpu的多节点优化，从3.2千兆像素对象生成1.6千兆像素全息图的执行时间为4.28秒。这意味着使用传统的基于fft的算法，CPU的顺序处理速度提高了237.92倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster

The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th Workshop on General Purpose GPUs

自引率

0.00%

发文量