2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第5页

CAGC: A Content-aware Garbage Collection Scheme for Ultra-Low Latency Flash-based SSDs 基于超低延迟闪存的ssd的内容感知垃圾收集方案

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00025

Suzhen Wu, Chunfeng Du, Haijun Li, Hong Jiang, Zhirong Shen, Bo Mao

{"title":"CAGC: A Content-aware Garbage Collection Scheme for Ultra-Low Latency Flash-based SSDs","authors":"Suzhen Wu, Chunfeng Du, Haijun Li, Hong Jiang, Zhirong Shen, Bo Mao","doi":"10.1109/IPDPS49936.2021.00025","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00025","url":null,"abstract":"With the advent of new flash-based memory technologies with ultra-low latency, directly applying inline data deduplication in flash-based storage devices can degrade the system performance since key deduplication operations lie on the shortened critical write path of such devices. To address the problem, we propose a Content-Aware Garbage Collection scheme (CAGC), which embeds the data deduplication into the data movement workflow of the Garbage Collection (GC) process in ultra-low latency flash-based SSDs. By parallelizing the operations of valid data pages migration, hash computing and flash block erase, the deduplication-induced performance overhead is alleviated and redundant page writes during the GC period are eliminated. To further reduce data writes and write amplification during GC, CAGC separates and stores data pages in different regions based on their reference counts. The performance evaluation of our CAGC prototype implemented in FlashSim shows that CAGC significantly reduces the number of flash blocks erased and data pages migrated during GC, leading to improved user I/O performance and reliability of ultra-low latency flash-based SSDs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122245598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Nowa: A Wait-Free Continuation-Stealing Concurrency Platform 现在:一个无等待的盗取延续的并发平台

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00044

Florian Schmaus, Nicolas Pfeiffer, Wolfgang Schröder-Preikschat, Timo Hönig, J. Nolte

{"title":"Nowa: A Wait-Free Continuation-Stealing Concurrency Platform","authors":"Florian Schmaus, Nicolas Pfeiffer, Wolfgang Schröder-Preikschat, Timo Hönig, J. Nolte","doi":"10.1109/IPDPS49936.2021.00044","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00044","url":null,"abstract":"It is an ongoing challenge to efficiently use parallelism with today’s multi- and many-core processors. Scalability becomes more crucial than ever with the rapidly growing number of processing elements in many-core systems that operate in data centres and embedded domains. Guaranteeing scalability is often ensured by using fully-strict fork/join concurrency, which is the prevalent approach used by concurrency platforms like Cilk. The runtime systems employed by those platforms typically resort to lock-based synchronisation due to the complex interactions of data structures within the runtime. However, locking limits scalability severely. With the availability of commercial off-the-shelf systems with hundreds of logical cores, this is becoming a problem for an increasing number of systems.This paper presents Nowa, a novel wait-free approach to arbitrate the plentiful concurrent strands managed by a concurrency platform’s runtime system. The wait-free approach is enabled by exploiting inherent properties of fully-strict fork/join concurrency, and hence is potentially applicable for every continuation-stealing runtime system of a concurrency platform. We have implemented Nowa and compared it with existing runtime systems, including Cilk Plus, and Threading Building Blocks (TBB), which employ a lock-based approach. Our evaluation results show that the wait-free implementation increases the performance up to 1.64× compared to lock-based ones, on a system with 256 hardware threads. The performance increased by 1.17× on average, while no but one benchmark exhibited performance regression. Compared against OpenMP tasks using Clang’s libomp, Nowa outperforms OpenMP by 8.68× on average.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130797092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance-Portable Graph Coarsening for Efficient Multilevel Graph Analysis 面向高效多级图分析的高性能便携式图粗化

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00030

Michael S. Gilbert, Seher Acer, E. Boman, Kamesh Madduri, S. Rajamanickam

引用次数: 5

AlphaR: Learning-Powered Resource Management for Irregular, Dynamic Microservice Graph AlphaR:不规则动态微服务图的学习驱动资源管理

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00089

Xiaofeng Hou, Chao Li, Jiacheng Liu, Lu Zhang, Shaolei Ren, Jingwen Leng, Quan Chen, M. Guo

{"title":"AlphaR: Learning-Powered Resource Management for Irregular, Dynamic Microservice Graph","authors":"Xiaofeng Hou, Chao Li, Jiacheng Liu, Lu Zhang, Shaolei Ren, Jingwen Leng, Quan Chen, M. Guo","doi":"10.1109/IPDPS49936.2021.00089","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00089","url":null,"abstract":"The microservice architecture is a hot trend which proposes to transform the traditional monolith application into massive dynamic and irregular small services. To boost the overall throughput and ensure the guaranteed latency, it is desirable to process massive service requests in parallel with efficient resource sharing in data centers. However, the disaggregation nature of microservice unavoidably upscales the design space of resource management and increases its complexity. In this paper, we propose AlphaR, a learning-powered resource management system tailored to the microservice environment. The basic idea of AlphaR is to generate microservice-specific resource management policies for improving efficiency. Specifically, we take the first step to use bipartite graph as a convenient abstraction for application built with microservices. Based on this, we devise a bipartite feature inference approach named Bi-GNN to extract the temporal characteristics of microservices. Furthermore, we implement a policy network to select appropriate resource allocation choices for maximizing the performance in resource-constrained data centers. AlphaR can improve the mean and p95 response time by up to 80% and 77.5% respectively compared with conventional schemes.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120945930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

An In-Depth Analysis of Distributed Training of Deep Neural Networks 深度神经网络分布式训练的深入分析

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00108

Yunyong Ko, Kibong Choi, Jiwon Seo, Sang-Wook Kim

{"title":"An In-Depth Analysis of Distributed Training of Deep Neural Networks","authors":"Yunyong Ko, Kibong Choi, Jiwon Seo, Sang-Wook Kim","doi":"10.1109/IPDPS49936.2021.00108","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00108","url":null,"abstract":"As the popularity of deep learning in industry rapidly grows, efficient training of deep neural networks (DNNs) becomes important. To train a DNN with a large amount of data, distributed training with data parallelism has been widely adopted. However, the communication overhead limits the scalability of distributed training. To reduce the overhead, a number of distributed training algorithms have been proposed. The model accuracy and training performance of those algorithms can be different depending on various factors such as cluster settings, training models/datasets, and optimization techniques applied. In order for someone to adopt a distributed training algorithm appropriate for her/his situation, it is required for her/him to fully understand the model accuracy and training performance of these algorithms in various settings. Toward this end, this paper reviews and evaluates seven popular distributed training algorithms (BSP, ASP, SSP, EASGD, AR-SGD, GoSGD, and AD-PSGD) in terms of the model accuracy and training performance in various settings. Specifically, we evaluate those algorithms for two CNN models, in different cluster settings, and with three well-known optimization techniques. Through extensive evaluation and analysis, we made several interesting discoveries. For example, we found out that some distributed training algorithms (SSP, EASGD, and GoSGD) have highly negative impact on the model accuracy because they adopt intermittent and asymmetric communication to improve training performance; the communication overhead of some centralized algorithms (ASP and SSP) is much higher than we expected in a cluster setting with limited network bandwidth because of the PS bottleneck problem. These findings, and many more in the paper, can guide the adoption of proper distributed training algorithms in industry; our findings can be useful in academia as well for designing new distributed training algorithms.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116015783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Jigsaw: A Slice-and-Dice Approach to Non-uniform FFT Acceleration for MRI Image Reconstruction 拼图:用于MRI图像重建的非均匀FFT加速的切片和骰子方法

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00081

Brendan L. West, J. Fessler, T. Wenisch

{"title":"Jigsaw: A Slice-and-Dice Approach to Non-uniform FFT Acceleration for MRI Image Reconstruction","authors":"Brendan L. West, J. Fessler, T. Wenisch","doi":"10.1109/IPDPS49936.2021.00081","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00081","url":null,"abstract":"The Fast Fourier Transform (FFT) is a fundamental algorithm in signal processing; significant efforts have been made to improve its performance using software optimizations and specialized hardware accelerators. Computational imaging modalities, such as MRI, often rely on the Non-uniform Fast Fourier Transform (NuFFT), a variant of the FFT for processing data acquired from non-uniform sampling patterns. The most time-consuming step of the NuFFT algorithm is “gridding;” wherein non-uniform samples are interpolated to allow a uniform FFT to be computed over the data. Each non-uniform sample affects a window of non-contiguous memory locations, resulting in poor cache and memory bandwidth utilization. As a result, gridding can account for more than 99.6% of the NuFFT computation time, while the FFT requires less than 0.4%. We present Slice-and-Dice, a novel approach to the NuFFT’s gridding step that eliminates the presorting operations required by prior methods and maps more efficiently to hardware. Our GPU implementation achieves gridding speedups of over 250× and 16× vs prior state-of-the-art CPU and GPU implementations, respectively. We achieve further speedup and energy efficiency gains by implementing Slice-and-Dice in hardware with JIGSAW, a streaming hardware accelerator for non-uniform data gridding. JIGSAW uses stall-free fixed-point pipelines to process M non-uniform samples in approximately M cycles, irrespective of sampling pattern—yielding speedups of over 1500× the CPU baseline and 36× the state-of-the-art GPU implementation, consuming $sim 200mathrm{m}mathrm{W}$ power and $sim 12mathrm{m}mathrm{m}^{2}$ area in 16 nm technology. Slice-and-Dice GPU and JIGSAW ASIC implementations achieve unprecedented end-to-end NuFFT speedups of 8× and 36× compared to the state-of-the-art GPU implementation, respectively.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116356827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Level Synthesis of Parallel Specifications Coupling Static and Dynamic Controllers 静态和动态控制器并联规格耦合的高级综合

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00028

Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi

{"title":"High-Level Synthesis of Parallel Specifications Coupling Static and Dynamic Controllers","authors":"Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi","doi":"10.1109/IPDPS49936.2021.00028","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00028","url":null,"abstract":"Conventional High-Level Synthesis (HLS) tools exploit parallelism mostly at the Instruction Level (ILP). They statically schedule the input specifications and build centralized Finite State Machine (FSM) controllers. However, aggressive exploitation of ILP in many applications has diminishing returns and, usually, centralized approaches do not efficiently exploit coarser parallelism, because FSMs are inherently serial. In this paper we present an HLS framework able to synthesize applications that, beside ILP, also expose Task Level Parallelism (TLP). An application can expose TLP through annotations that identify the parallel functions (i.e., tasks). To generate accelerators that efficiently execute concurrent tasks, we need to solve several issues: devise a mechanism to support concurrent execution flows, exploit memory parallelism, and manage synchronization. To support concurrent execution flows, we introduce a novel adaptive controller. The adaptive controller is composed of a set of interacting control elements that independently manage the execution of a task. These control elements check dependencies and resource constraints at runtime, enabling as soon as possible execution. To support parallel access to shared memories and synchronization, we integrate with a novel Hierarchical Memory Interface (HMI). With respect to previous solutions, the proposed interface supports multi-ported memories and atomic memory operations, which commonly occur in parallel programming. Our framework can generate the hardware implementation of C functions by employing two different approaches, depending on its characteristics. If a function exposes TLP, then the framework generates hardware implementations based on the adaptive controller. Otherwise, the framework implements the function through the FSM approach, which is optimized for ILP exploitation. We evaluate our framework on a set of parallel applications and show substantial performance improvements (average speedup of 4.7) with limited area overheads (average area increase of 5.48 times)","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123428192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Efficient Video Captioning on Heterogeneous System Architectures 异构系统架构下的高效视频字幕

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00112

Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, W. Hsu

{"title":"Efficient Video Captioning on Heterogeneous System Architectures","authors":"Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, W. Hsu","doi":"10.1109/IPDPS49936.2021.00112","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00112","url":null,"abstract":"Video captioning is the core technology to drive the development of many important multidisciplinary applications, such as AI-assisted medical diagnosis, storytelling through videos, video question answering, lip-reading, just to name a few. Video captioning employs a hybrid CNN+RNN neural network model to translate video scenes into natural language descriptions. For deep learning inference, a typical approach is running both the CNN and the RNN on a GPU. Such a GPU-only approach often suffers long inference time due to underutilization of the computing power offered by the CPU+GPU heterogeneous system architecture, which is a common architecture in modern computers.This work is an early effort to tackle the performance issue of performing deep learning inference using a hybrid CNN+RNN model on a heterogeneous system with a CPU and a GPU. This is a challenging task because of (1) CNN and RNN exhibit very different computing behaviors. This raises the question of how to split the two models into computing tasks and properly assign the tasks to the CPU and the GPU to minimize the inference time for a video frame, and (2) Data dependency exists between the CNN and the RNN within a video frame, as well as between the adjacent RNNs across two video frames. These data dependencies prohibit full parallelization of the hybrid model. To solve these two problems, we propose two optimizations: a fine-grained scheduling scheme for mapping computation and devices within a video frame, and a pipeline scheduling scheme to exploit maximum parallelism between the execution of the video frames. To facilitate our optimizations, we also develop an accurate regression-based cost model to predict the computation time of CNN/RNN operations and the communication time for moving data between CPU and GPU. Experimental results show that our optimization improves the performance of video captioning by up to 3.24× on the CPU+GPU system, compared with the GPU-only execution.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

High Performance Streaming Tensor Decomposition 高性能流张量分解

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00078

Yongseok Soh, P. Flick, Xing Liu, Shaden Smith, Fabio Checconi, F. Petrini, Jee W. Choi

{"title":"High Performance Streaming Tensor Decomposition","authors":"Yongseok Soh, P. Flick, Xing Liu, Shaden Smith, Fabio Checconi, F. Petrini, Jee W. Choi","doi":"10.1109/IPDPS49936.2021.00078","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00078","url":null,"abstract":"We present a new algorithm for computing tensor decomposition on streaming data that achieves up to 102$times$ speedup over the state-of-the-art CP-stream algorithm through lower computational complexity and performance optimization. For each streaming time slice, our algorithm partitions the factor matrix rows into those with and without updates and keeps them in Gram matrix form to significantly reduce the required computation. We also improve the scalability and performance of the matricized tensor times Khatri-Rao product (MTTKRP) kernel, a key performance bottleneck in many tensor decomposition algorithms, by reducing the synchronization overhead through the combined use of mutex locks and thread-local memory. For problems with constraints (e.g., non-negativity), we apply data blocking and operation fusion to the alternating direction method of multiplier (ADMM) kernel in the constrained CP-stream algorithm. By combining this ADMM optimization with the aforementioned MTTKRP optimization, our improved algorithm achieves up to $ 47times$ speedup over the original. We evaluate the performance and scalability of our new algorithm and optimization techniques using a 56-core quad-socket Intel Xeon system on four representative real-world tensors.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125950268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Multi-GPU Design for Large Size Cryo-EM 3D Reconstruction 大尺寸低温电镜三维重建的多gpu设计

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI: 10.1109/IPDPS49936.2021.00094

Zihao Wang, Xiaohua Wan, Zhiyong Liu, Qianshuo Fan, Fa Zhang, Guangming Tan

{"title":"A Multi-GPU Design for Large Size Cryo-EM 3D Reconstruction","authors":"Zihao Wang, Xiaohua Wan, Zhiyong Liu, Qianshuo Fan, Fa Zhang, Guangming Tan","doi":"10.1109/IPDPS49936.2021.00094","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00094","url":null,"abstract":"Three-dimensional (3D) reconstruction of cryo-electron microscopy (cryo-EM) is a powerful method to determine the structures of macromolecules at near-atomic resolution. Recently, larger size with finer resolution 2D images has been collected, which can improve the reconstruction resolution. However, large size data incurs high computation and huge memory overhead. Current implementations fail to perform the complete reconstruction workflow on a multi-GPU cluster for large size data. Because of no effective parallel method for 3D convolution and the huge memory demanding, large size data can not be efficiently reconstructed, which impede the resolution improving 3D reconstruction. To enable cryo-EM 3D reconstruction with large size data on multi-GPU, in this work, we propose a new parallel framework called OML-Relion. In OML-Relion, we first adopt a stride based Fourier transform and eliminate data dependence to parallelize the 3D convolution on multi-GPU. Considering the input size varying in each iteration, we next use an auto-tuning model to optimize 3D convolution performance. Finally, guaranteeing the whole reconstruction on a multi-GPU cluster for large size data, we design a novel lossless data compression algorithm to reduce memory overhead on each GPU further. The experiment shows that OML-Relion can efficiently handle large size cryo-EM 3D reconstruction on multi-GPU. The reconstruction module, including 3D convolution operation, achieves 225-330x times speedup for 200-800 pixel size particles. The compression algorithm significantly reduces memory overhead approaching 70%. Moreover, the whole workflow with OMLRelion can achieve 54-65x speedup compared with Relion using two large size datasets.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127253438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0