Scaling and Parallelization of Big Data Analysis on HPC and Cloud Systems

2019 International Conference on Advances in Computing and Communication Engineering (ICACCE) Pub Date : 2019-04-01 DOI:10.1109/ICACCE46606.2019.9079987

Mike Mikailov, N. Petrick, Yasameen Azarbaijani, Fu-Jyh Luo, Lohit Valleru, Stephen Whitney, Yelizaveta Torosyan

{"title":"Scaling and Parallelization of Big Data Analysis on HPC and Cloud Systems","authors":"Mike Mikailov, N. Petrick, Yasameen Azarbaijani, Fu-Jyh Luo, Lohit Valleru, Stephen Whitney, Yelizaveta Torosyan","doi":"10.1109/ICACCE46606.2019.9079987","DOIUrl":null,"url":null,"abstract":"Big data analysis can exhibit significant scaling problems when migrated to High Performance Computing (HPC) clusters and/or cloud computing platforms if traditional software parallelization techniques such as POSIX multi-threading and Message Passing Interface (MPI) are used. This paper introduces a novel scaling technique based on a-well-known array job mechanism to enable a team of FDA researchers to validate a method for identifying evidence of possible adverse events in very large sets of patient medical records. The analysis employed the widely-used basic Statistical Analysis Software (SAS) package, and the proposed parallelization approach dramatically increased the scaling and thus the speed of job completion for this application and is applicable to any similar software written in any other programming language. The new scaling technique offers O(T) theoretical speedup in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. The basis of the new approach is the segmentation of both (a) the big data set being analyzed and (b) the large number of SAS analysis types applied to each data segment. The large number of unique pairs of data set segment and analysis type segment are then each processed by a separate computing node (core) in pseudo-parallel manner. As a result, a SAS big data analysis which required more than 10 days to complete and consumed more than a terabyte of RAM on a single multi-core computing node completed in less than an hour parallelized over a large number of nodes, none of which needed more than 50 GB of RAM. The massive increase in the number of tasks when running an analysis job with this degree of segmentation reduces the size, scope and execution time of each task. Besides the significant speed improvement, additional benefits include fine-grained checkpointing and increased flexibility of job submission.","PeriodicalId":317123,"journal":{"name":"2019 International Conference on Advances in Computing and Communication Engineering (ICACCE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advances in Computing and Communication Engineering (ICACCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACCE46606.2019.9079987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Big data analysis can exhibit significant scaling problems when migrated to High Performance Computing (HPC) clusters and/or cloud computing platforms if traditional software parallelization techniques such as POSIX multi-threading and Message Passing Interface (MPI) are used. This paper introduces a novel scaling technique based on a-well-known array job mechanism to enable a team of FDA researchers to validate a method for identifying evidence of possible adverse events in very large sets of patient medical records. The analysis employed the widely-used basic Statistical Analysis Software (SAS) package, and the proposed parallelization approach dramatically increased the scaling and thus the speed of job completion for this application and is applicable to any similar software written in any other programming language. The new scaling technique offers O(T) theoretical speedup in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. The basis of the new approach is the segmentation of both (a) the big data set being analyzed and (b) the large number of SAS analysis types applied to each data segment. The large number of unique pairs of data set segment and analysis type segment are then each processed by a separate computing node (core) in pseudo-parallel manner. As a result, a SAS big data analysis which required more than 10 days to complete and consumed more than a terabyte of RAM on a single multi-core computing node completed in less than an hour parallelized over a large number of nodes, none of which needed more than 50 GB of RAM. The massive increase in the number of tasks when running an analysis job with this degree of segmentation reduces the size, scope and execution time of each task. Besides the significant speed improvement, additional benefits include fine-grained checkpointing and increased flexibility of job submission.

查看原文本刊更多论文

HPC和云系统上大数据分析的扩展和并行化

如果使用传统的软件并行技术，如POSIX多线程和消息传递接口(MPI)，大数据分析在迁移到高性能计算(HPC)集群和/或云计算平台时可能会出现严重的扩展问题。本文介绍了一种基于众所周知的阵列工作机制的新型缩放技术，使FDA研究人员团队能够验证一种方法，该方法可以识别大量患者医疗记录中可能出现的不良事件的证据。该分析使用了广泛使用的basic Statistical analysis Software (SAS)软件包，提出的并行化方法极大地提高了该应用程序的可伸缩性，从而提高了作业完成的速度，并且适用于用任何其他编程语言编写的任何类似软件。与多线程和MPI技术相比，新的缩放技术提供了0 (T)理论上的加速。这里T是数组作业任务的数量。新方法的基础是(a)被分析的大数据集和(b)应用于每个数据段的大量SAS分析类型的分割。大量唯一对的数据集段和分析类型段分别由单独的计算节点(核心)以伪并行的方式进行处理。因此，在单个多核计算节点上需要10天以上才能完成并消耗超过1tb RAM的SAS大数据分析在不到一个小时的时间内就可以在大量节点上并行完成，这些节点都不需要超过50gb的RAM。在运行具有这种分段程度的分析作业时，任务数量的大量增加减少了每个任务的大小、范围和执行时间。除了显著提高速度之外，其他好处还包括细粒度检查点和作业提交灵活性的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on Advances in Computing and Communication Engineering (ICACCE)

自引率

0.00%

发文量