Evaluation of Combining Bootstrap with Multiple Imputation Using R on Knights Landing Platform

2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud) Pub Date : 2017-06-26 DOI:10.1109/CSCLOUD.2017.55

Chuan Zhou, Yuxiang Gao, Waylon Howard

{"title":"Evaluation of Combining Bootstrap with Multiple Imputation Using R on Knights Landing Platform","authors":"Chuan Zhou, Yuxiang Gao, Waylon Howard","doi":"10.1109/CSCLOUD.2017.55","DOIUrl":null,"url":null,"abstract":"Cloud computing and big data technologies are converging to offer a cost-effective delivery model for cloud-based big data analytics. Though impacts of size and scaling of big data on cloud have been extensively studied, the effects of complexity of underlying analytic methods on cloud performance have received less attention. This paper will develop and evaluate a computationally intensive statistical methodology to perform inference in the presence of both non-Gaussian data and missing data. Two well-established statistical approaches, bootstrap and multiple imputations (MI), will be combined to form the methodology. Bootstrap is a computer-based nonparametric resampling procedure that involves randomly selecting data many thousands of times to construct an empirical distribution, which is then used to construct confidence intervals for significance tests. This statistical technique enables scientists who conduct studies on data with known non-normality to obtain higher quality significance tests than is possible with a traditional asymptotic, normal-theory based significance test. However, the bootstrapping procedure only works when no data are missing or the data are missing completely at random (MCAR). Missing data can lead to biased estimates when the MCAR assumption is violated. It is unclear how to best implement a bootstrapping procedure in the presence of missing data. The proposed methods will provide guidelines and procedures that will enable researchers to use the technique in all areas of health, behavior and developmental science in which a study has missing data and cannot rely on parametric inference. Either bootstrapping or MI can be computationally expensive, and combining these two can lead to further computation costs in the cloud. Using carefully constructed simulation examples, we demonstrate that it is feasible to implement the proposed methodology in a high performance Knights Landing platform. However, the computation costs are substantial even with small data size. Further studies are needed to study the effects of optimizing the implementation and its performance with big data.","PeriodicalId":436299,"journal":{"name":"2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSCLOUD.2017.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Cloud computing and big data technologies are converging to offer a cost-effective delivery model for cloud-based big data analytics. Though impacts of size and scaling of big data on cloud have been extensively studied, the effects of complexity of underlying analytic methods on cloud performance have received less attention. This paper will develop and evaluate a computationally intensive statistical methodology to perform inference in the presence of both non-Gaussian data and missing data. Two well-established statistical approaches, bootstrap and multiple imputations (MI), will be combined to form the methodology. Bootstrap is a computer-based nonparametric resampling procedure that involves randomly selecting data many thousands of times to construct an empirical distribution, which is then used to construct confidence intervals for significance tests. This statistical technique enables scientists who conduct studies on data with known non-normality to obtain higher quality significance tests than is possible with a traditional asymptotic, normal-theory based significance test. However, the bootstrapping procedure only works when no data are missing or the data are missing completely at random (MCAR). Missing data can lead to biased estimates when the MCAR assumption is violated. It is unclear how to best implement a bootstrapping procedure in the presence of missing data. The proposed methods will provide guidelines and procedures that will enable researchers to use the technique in all areas of health, behavior and developmental science in which a study has missing data and cannot rely on parametric inference. Either bootstrapping or MI can be computationally expensive, and combining these two can lead to further computation costs in the cloud. Using carefully constructed simulation examples, we demonstrate that it is feasible to implement the proposed methodology in a high performance Knights Landing platform. However, the computation costs are substantial even with small data size. Further studies are needed to study the effects of optimizing the implementation and its performance with big data.

查看原文本刊更多论文

基于R的骑士登陆平台自举与多重插值相结合的评价

云计算和大数据技术正在融合，为基于云的大数据分析提供一种经济高效的交付模式。虽然大数据的规模和规模对云的影响已经得到了广泛的研究，但底层分析方法的复杂性对云性能的影响却很少受到关注。本文将开发和评估一种计算密集的统计方法，以在非高斯数据和缺失数据的存在下执行推理。两种完善的统计方法，bootstrap和多重imputation (MI)，将结合起来形成方法论。Bootstrap是一种基于计算机的非参数重采样过程，它涉及随机选择数据数千次以构建经验分布，然后用于构建显著性检验的置信区间。这种统计技术使科学家能够对具有已知非正态性的数据进行研究，从而获得比传统的渐进的、基于正态理论的显著性检验更高质量的显著性检验。然而，引导过程只有在没有数据丢失或数据完全随机丢失(MCAR)时才能工作。当违反MCAR假设时，数据缺失可能导致估计偏差。在缺少数据的情况下，如何最好地实现引导过程尚不清楚。拟议的方法将提供指导方针和程序，使研究人员能够在缺乏数据和不能依靠参数推理的所有健康、行为和发展科学领域使用该技术。无论是自引导还是人工智能都可能在计算上很昂贵，并且将这两者结合起来可能会导致云中进一步的计算成本。通过精心构建的仿真示例，我们证明了在高性能骑士登陆平台上实现所提出的方法是可行的。然而，即使数据量很小，计算成本也很大。利用大数据优化实施方案及其性能的效果有待进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud)

自引率

0.00%

发文量