{"title":"cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs","authors":"Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, M. Jiang","doi":"10.1145/3205289.3205309","DOIUrl":null,"url":null,"abstract":"Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
Low-dose X-ray computed tomography (XCT) is a popular imaging technique to visualize the inside structure of object non-destructively. Model-based Iterative Reconstruction (MBIR) method can reconstruct high-quality image but at the cost of large computational demands. Therefore, MBIR of ten resorts to the platforms with hardware accelerators such as GPUs to speed up the reconstruction process. For MBIR, the reconstruction process is to minimize an objective function by updating image iteratively. The X-ray source emits large amounts of X-rays from various views to cover the object as much as possible. Different X-rays always have complex and irregular geometric relationship. This inherent irregularity makes the minimization process of the objective function on GPUs very challenging. First, different implementations of the minimization of objective function have different impacts on the convergence and GPU resource utilization. To this end, we explore different solvers to the minimization problem and different parallelism granularities for GPU kernel design. Second, the complex and irregular geometric relationship of X-rays introduces irregular memory behaviors. Two nearby X-rays may intersect and thus incur memory collisions, while two far away X-rays may incur non-coalesced memory accesses. We design a unified thread mapping algorithm to guide the mapping from X-rays to threads, which can optimize the memory collisions and non-coalesced memory accesses together. Finally, we present a series of architecture level optimizations to fully release the horse power of GPUs. Evaluation results demonstrate that cuMBIR can achieve 1.48X speedup over the state-of-the-art implementation on GPUs.