x射线计算机断层扫描在文物中的应用:低功耗片上滤波反投影重建算法的移植与测试

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) Pub Date : 2016-04-04 DOI:10.1109/PDP.2016.60

Elena Corni, L. Morganti, M. Morigi, R. Brancaccio, M. Bettuzzi, G. Levi, E. Peccenini, D. Cesini, A. Ferraro

{"title":"x射线计算机断层扫描在文物中的应用:低功耗片上滤波反投影重建算法的移植与测试","authors":"Elena Corni, L. Morganti, M. Morigi, R. Brancaccio, M. Bettuzzi, G. Levi, E. Peccenini, D. Cesini, A. Ferraro","doi":"10.1109/PDP.2016.60","DOIUrl":null,"url":null,"abstract":"The embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. This convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. This Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. The porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit both the SoC CPU and GPU. The performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. The results obtained with all the developed versions are reported and compared with those obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. The best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. The results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"X-Ray Computed Tomography Applied to Objects of Cultural Heritage: Porting and Testing the Filtered Back-Projection Reconstruction Algorithm on Low Power Systems-on-Chip\",\"authors\":\"Elena Corni, L. Morganti, M. Morigi, R. Brancaccio, M. Bettuzzi, G. Levi, E. Peccenini, D. Cesini, A. Ferraro\",\"doi\":\"10.1109/PDP.2016.60\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. This convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. This Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. The porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit both the SoC CPU and GPU. The performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. The results obtained with all the developed versions are reported and compared with those obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. The best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. The results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.\",\"PeriodicalId\":192273,\"journal\":{\"name\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP.2016.60\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2016.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

过去完全分离的嵌入式和高性能计算(HPC)领域，现在在两种驱动力的压力下，以某种方式融合在一起:低功耗服务器处理器的发布，以及为满足苛刻的移动市场需求而开发的新型低功耗片上系统(soc)的性能提高。这种融合允许将原本局限于传统HPC系统的应用程序移植到低功耗嵌入式架构。在本文中，我们介绍了将滤波反投影算法移植到低功耗，低成本的片上系统NVIDIA Tegra K1的经验，该系统基于四核ARM CPU和NVIDIA Kepler GPU。这种滤波反投影算法在三维断层扫描重建软件中被大量使用。移植已经利用了各种编程语言(即OpenMP, CUDA)，并且已经开发了多个版本的应用程序来利用SoC CPU和GPU。性能是根据每时间单位和每能量单位重建的二维切片(三维体积)来测量的。报告了在所有开发版本上获得的结果，并与使用最新NVIDIA GPU加速的典型x86 HPC节点上获得的结果进行了比较。结合OpenMP版本和CUDA版本的算法可以获得最佳性能。特别是，我们发现只有三块配备千兆以太网互连的Jetson TK1板，每单位时间内可以重建与传统服务器一样多的图像，使用的能量要少一个数量级。这项工作的结果可以应用于便携式层析仪的节能计算系统的构建。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

X-Ray Computed Tomography Applied to Objects of Cultural Heritage: Porting and Testing the Filtered Back-Projection Reconstruction Algorithm on Low Power Systems-on-Chip

The embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. This convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. This Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. The porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit both the SoC CPU and GPU. The performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. The results obtained with all the developed versions are reported and compared with those obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. The best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. The results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)

自引率

0.00%

发文量