Mike Mikailov, N. Petrick, Yasameen Azarbaijani, Fu-Jyh Luo, Lohit Valleru, Stephen Whitney, Yelizaveta Torosyan
{"title":"Scaling and Parallelization of Big Data Analysis on HPC and Cloud Systems","authors":"Mike Mikailov, N. Petrick, Yasameen Azarbaijani, Fu-Jyh Luo, Lohit Valleru, Stephen Whitney, Yelizaveta Torosyan","doi":"10.1109/ICACCE46606.2019.9079987","DOIUrl":null,"url":null,"abstract":"Big data analysis can exhibit significant scaling problems when migrated to High Performance Computing (HPC) clusters and/or cloud computing platforms if traditional software parallelization techniques such as POSIX multi-threading and Message Passing Interface (MPI) are used. This paper introduces a novel scaling technique based on a-well-known array job mechanism to enable a team of FDA researchers to validate a method for identifying evidence of possible adverse events in very large sets of patient medical records. The analysis employed the widely-used basic Statistical Analysis Software (SAS) package, and the proposed parallelization approach dramatically increased the scaling and thus the speed of job completion for this application and is applicable to any similar software written in any other programming language. The new scaling technique offers O(T) theoretical speedup in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. The basis of the new approach is the segmentation of both (a) the big data set being analyzed and (b) the large number of SAS analysis types applied to each data segment. The large number of unique pairs of data set segment and analysis type segment are then each processed by a separate computing node (core) in pseudo-parallel manner. As a result, a SAS big data analysis which required more than 10 days to complete and consumed more than a terabyte of RAM on a single multi-core computing node completed in less than an hour parallelized over a large number of nodes, none of which needed more than 50 GB of RAM. The massive increase in the number of tasks when running an analysis job with this degree of segmentation reduces the size, scope and execution time of each task. Besides the significant speed improvement, additional benefits include fine-grained checkpointing and increased flexibility of job submission.","PeriodicalId":317123,"journal":{"name":"2019 International Conference on Advances in Computing and Communication Engineering (ICACCE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Advances in Computing and Communication Engineering (ICACCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACCE46606.2019.9079987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Big data analysis can exhibit significant scaling problems when migrated to High Performance Computing (HPC) clusters and/or cloud computing platforms if traditional software parallelization techniques such as POSIX multi-threading and Message Passing Interface (MPI) are used. This paper introduces a novel scaling technique based on a-well-known array job mechanism to enable a team of FDA researchers to validate a method for identifying evidence of possible adverse events in very large sets of patient medical records. The analysis employed the widely-used basic Statistical Analysis Software (SAS) package, and the proposed parallelization approach dramatically increased the scaling and thus the speed of job completion for this application and is applicable to any similar software written in any other programming language. The new scaling technique offers O(T) theoretical speedup in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. The basis of the new approach is the segmentation of both (a) the big data set being analyzed and (b) the large number of SAS analysis types applied to each data segment. The large number of unique pairs of data set segment and analysis type segment are then each processed by a separate computing node (core) in pseudo-parallel manner. As a result, a SAS big data analysis which required more than 10 days to complete and consumed more than a terabyte of RAM on a single multi-core computing node completed in less than an hour parallelized over a large number of nodes, none of which needed more than 50 GB of RAM. The massive increase in the number of tasks when running an analysis job with this degree of segmentation reduces the size, scope and execution time of each task. Besides the significant speed improvement, additional benefits include fine-grained checkpointing and increased flexibility of job submission.