大型光学显微镜图像拼接的CPU-GPU混合系统

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI:10.1109/ICPP.2014.9

Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou

{"title":"大型光学显微镜图像拼接的CPU-GPU混合系统","authors":"Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou","doi":"10.1109/ICPP.2014.9","DOIUrl":null,"url":null,"abstract":"Researchers in various fields are using optical microscopy to acquire very large images, 10000 - 200000 of pixels per side. Optical microscopes acquire these images as grids of overlapping partial images (thousands of pixels per side) that are then stitched together via software. Composing such large images is a compute and data intensive task even for modern machines. Researchers compound this difficulty further by obtaining time-series, volumetric, or multiple channel images with the resulting data sets now having or approaching terabyte sizes. We present a scalable hybrid CPU-GPU implementation of image stitching that processes large image sets at near interactive rates. Our implementation scales well with both image sizes and the number of CPU cores and GPU cards in a machine. It processes a grid of 42 × 59 tiles into a 17 k × 22 k pixels image in 43 s (end-to-end execution times) when using one NVIDIA Tesla C2070 card and two Intel Xeon E-5620 quad-core CPUs, and in 29 s when using two Tesla C2070 cards and the same two CPUs. It also composes and renders the composite image without saving it in 15 s. In comparison, ImageJ/Fiji, which is widely used by biologists, has an image stitching plugin that takes > 3.6 h for the same workload despite being multithreaded and executing the same mathematical operators, it composes and saves the large image in an additional 1.5 h. This implementation takes advantage of coarse-grain parallelism. It organizes the computation into a pipeline architecture that spans CPU and GPU resources and overlaps computation with data motion. The implementation achieves a nearly 10× performance improvement over our optimized non-pipeline GPU implementation and demonstrates near-linear speedup when increasing CPU thread count and increasing number of GPUs.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"A Hybrid CPU-GPU System for Stitching Large Scale Optical Microscopy Images\",\"authors\":\"Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou\",\"doi\":\"10.1109/ICPP.2014.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Researchers in various fields are using optical microscopy to acquire very large images, 10000 - 200000 of pixels per side. Optical microscopes acquire these images as grids of overlapping partial images (thousands of pixels per side) that are then stitched together via software. Composing such large images is a compute and data intensive task even for modern machines. Researchers compound this difficulty further by obtaining time-series, volumetric, or multiple channel images with the resulting data sets now having or approaching terabyte sizes. We present a scalable hybrid CPU-GPU implementation of image stitching that processes large image sets at near interactive rates. Our implementation scales well with both image sizes and the number of CPU cores and GPU cards in a machine. It processes a grid of 42 × 59 tiles into a 17 k × 22 k pixels image in 43 s (end-to-end execution times) when using one NVIDIA Tesla C2070 card and two Intel Xeon E-5620 quad-core CPUs, and in 29 s when using two Tesla C2070 cards and the same two CPUs. It also composes and renders the composite image without saving it in 15 s. In comparison, ImageJ/Fiji, which is widely used by biologists, has an image stitching plugin that takes > 3.6 h for the same workload despite being multithreaded and executing the same mathematical operators, it composes and saves the large image in an additional 1.5 h. This implementation takes advantage of coarse-grain parallelism. It organizes the computation into a pipeline architecture that spans CPU and GPU resources and overlaps computation with data motion. The implementation achieves a nearly 10× performance improvement over our optimized non-pipeline GPU implementation and demonstrates near-linear speedup when increasing CPU thread count and increasing number of GPUs.\",\"PeriodicalId\":441115,\"journal\":{\"name\":\"2014 43rd International Conference on Parallel Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 43rd International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2014.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

各个领域的研究人员都在使用光学显微镜来获取非常大的图像，每边10000 - 200000像素。光学显微镜将这些图像作为重叠部分图像(每边数千像素)的网格获取，然后通过软件将其拼接在一起。即使对于现代机器来说，合成如此大的图像也是一项计算和数据密集型任务。研究人员通过获得时间序列、体积或多通道图像来进一步增加这一困难，结果数据集现在已经或接近tb大小。我们提出了一种可扩展的混合CPU-GPU图像拼接实现，以接近交互速率处理大型图像集。我们的实现可以很好地扩展图像大小以及机器中CPU内核和GPU卡的数量。当使用一个NVIDIA Tesla C2070卡和两个Intel至强E-5620四核cpu时，它在43秒(端到端执行时间)内将一个42 × 59块的网格处理成一个17 k × 22 k像素的图像，而当使用两个Tesla C2070卡和相同的两个cpu时，则在29秒内。它还可以在15秒内合成和渲染合成图像而不保存它。相比之下，生物学家广泛使用的ImageJ/Fiji的图像拼接插件，尽管是多线程的，并且执行相同的数学运算符，但对于相同的工作负载，它需要> 3.6小时的时间来合成和保存大图像，它需要额外的1.5小时。这种实现利用了粗粒度并行性。它将计算组织到一个跨越CPU和GPU资源的管道架构中，并将计算与数据运动重叠。与我们优化的非流水线GPU实现相比，该实现实现了近10倍的性能提升，并且在增加CPU线程数和增加GPU数量时显示出接近线性的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Hybrid CPU-GPU System for Stitching Large Scale Optical Microscopy Images

Researchers in various fields are using optical microscopy to acquire very large images, 10000 - 200000 of pixels per side. Optical microscopes acquire these images as grids of overlapping partial images (thousands of pixels per side) that are then stitched together via software. Composing such large images is a compute and data intensive task even for modern machines. Researchers compound this difficulty further by obtaining time-series, volumetric, or multiple channel images with the resulting data sets now having or approaching terabyte sizes. We present a scalable hybrid CPU-GPU implementation of image stitching that processes large image sets at near interactive rates. Our implementation scales well with both image sizes and the number of CPU cores and GPU cards in a machine. It processes a grid of 42 × 59 tiles into a 17 k × 22 k pixels image in 43 s (end-to-end execution times) when using one NVIDIA Tesla C2070 card and two Intel Xeon E-5620 quad-core CPUs, and in 29 s when using two Tesla C2070 cards and the same two CPUs. It also composes and renders the composite image without saving it in 15 s. In comparison, ImageJ/Fiji, which is widely used by biologists, has an image stitching plugin that takes > 3.6 h for the same workload despite being multithreaded and executing the same mathematical operators, it composes and saves the large image in an additional 1.5 h. This implementation takes advantage of coarse-grain parallelism. It organizes the computation into a pipeline architecture that spans CPU and GPU resources and overlaps computation with data motion. The implementation achieves a nearly 10× performance improvement over our optimized non-pipeline GPU implementation and demonstrates near-linear speedup when increasing CPU thread count and increasing number of GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 43rd International Conference on Parallel Processing

自引率

0.00%

发文量