Performance portable back-projection algorithms on CPUs: agnostic data locality and vectorization optimizations

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-04-27 DOI:10.1145/3447818.3460353

Peng Chen, M. Wahib, Xiao Wang, Shin'ichiro Takizawa, Takahiro Hirofuchi, Hirotaka Ogawa, S. Matsuoka

{"title":"Performance portable back-projection algorithms on CPUs: agnostic data locality and vectorization optimizations","authors":"Peng Chen, M. Wahib, Xiao Wang, Shin'ichiro Takizawa, Takahiro Hirofuchi, Hirotaka Ogawa, S. Matsuoka","doi":"10.1145/3447818.3460353","DOIUrl":null,"url":null,"abstract":"Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2 times speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state‐of‐the‐art CPU, we reach performance that rivals top-performing GPUs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Computed Tomography (CT) is a key 3D imaging technology that fundamentally relies on the compute-intense back-projection operation to generate 3D volumes. GPUs are typically used for back-projection in production CT devices. However, with the rise of power-constrained micro-CT devices, and also the emergence of CPUs comparable in performance to GPUs, back-projection for CPUs could become favorable. Unlike GPUs, extracting parallelism for back-projection algorithms on CPUs is complex given that parallelism and locality are not explicitly defined and controlled by the programmer, as is the case when using CUDA for instance. We propose a collection of novel back-projection algorithms that reduce the arithmetic computation, robustly enable vectorization, enforce a regular memory access pattern, and maximize the data locality. We also implement the novel algorithms as efficient back-projection kernels that are performance portable over a wide range of CPUs. Performance evaluation using a variety of CPUs from different vendors and generations demonstrates that our back-projection implementation achieves on average 5.2 times speedup over the multi-threaded implementation of the most widely used, and optimized, open library. With a state‐of‐the‐art CPU, we reach performance that rivals top-performing GPUs.

查看原文本刊更多论文

cpu上的性能便携反投影算法:不可知的数据位置和向量化优化

计算机断层扫描(CT)是一种关键的三维成像技术，它基本上依赖于计算密集型的反向投影操作来生成三维体。gpu通常用于生产CT设备的反向投影。然而，随着功耗受限的微型ct设备的兴起，以及与gpu性能相当的cpu的出现，cpu的反向投影可能会变得有利。与gpu不同，在cpu上提取反投影算法的并行性是复杂的，因为并行性和局部性不是由程序员明确定义和控制的，例如使用CUDA时就是这种情况。我们提出了一系列新的反投影算法，这些算法减少了算术计算，鲁棒地实现了向量化，强制执行了规则的内存访问模式，并最大限度地提高了数据的局域性。我们还将新算法实现为高效的反向投影内核，这些内核在各种cpu上具有性能可移植性。使用来自不同厂商和不同时代的各种cpu进行的性能评估表明，我们的反向投影实现比使用最广泛和优化的开放库的多线程实现平均提高5.2倍的速度。凭借最先进的CPU，我们达到了与顶级gpu相媲美的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量