On the Analytic Power of Divide & Recombine (D&R)

Proceedings of the 3rd International Conference on Statistics: Theory and Applications Pub Date : 2021-08-01 DOI:10.11159/icsta21.003

W. Cleveland

{"title":"On the Analytic Power of Divide & Recombine (D&R)","authors":"W. Cleveland","doi":"10.11159/icsta21.003","DOIUrl":null,"url":null,"abstract":"In D&R (aka Split & Conquer), the data are divided into subsets. The division serves as a base for analysis of big data and for data visualization. Different analytic processes are applied to the subsets that constitute a recombination of the information in the data. For big data there are three scenarios. (1) The division is based on the subject matter, e.g., financial data for 100 banks; the division is by bank, and the 100 outputs of analytic methods are further analyzed. (2) An analytic method is applied to each subset, and the outputs are recombined with a recombination method applied to get one result for all of the data. This can provide, for all if the data, estimates of parameters or more complex information such as a likelihood function. D&R research consists of finding division and recombination methods that maximize statistical accuracy. Parallel distributed environments like Hadoop and Spark provide high computational performance for (1) and (2). (3) Similarly, an analytic method is applied to all subsets, but an iterative MM algorithm is used for optimization, e.g., maximum likelihood, that among other nice properties can avoid very large matrix inversion, turn a non-differentiable problem into a smooth problem, etc. For visualization, subsets are created by conditioning on one more variables of the analysis to create subsets of the other variables in the analysis. The subsets are displayed using the Trellis Display framework of multi-panel display. This provides a very powerful mechanism for exploratory study of multi-dimensional datasets, modeling the data, and understanding the results of analysis.","PeriodicalId":403959,"journal":{"name":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Statistics: Theory and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/icsta21.003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In D&R (aka Split & Conquer), the data are divided into subsets. The division serves as a base for analysis of big data and for data visualization. Different analytic processes are applied to the subsets that constitute a recombination of the information in the data. For big data there are three scenarios. (1) The division is based on the subject matter, e.g., financial data for 100 banks; the division is by bank, and the 100 outputs of analytic methods are further analyzed. (2) An analytic method is applied to each subset, and the outputs are recombined with a recombination method applied to get one result for all of the data. This can provide, for all if the data, estimates of parameters or more complex information such as a likelihood function. D&R research consists of finding division and recombination methods that maximize statistical accuracy. Parallel distributed environments like Hadoop and Spark provide high computational performance for (1) and (2). (3) Similarly, an analytic method is applied to all subsets, but an iterative MM algorithm is used for optimization, e.g., maximum likelihood, that among other nice properties can avoid very large matrix inversion, turn a non-differentiable problem into a smooth problem, etc. For visualization, subsets are created by conditioning on one more variables of the analysis to create subsets of the other variables in the analysis. The subsets are displayed using the Trellis Display framework of multi-panel display. This provides a very powerful mechanism for exploratory study of multi-dimensional datasets, modeling the data, and understanding the results of analysis.

查看原文本刊更多论文

论分割重组(D&R)的解析力

在D&R(又名分裂与征服)中，数据被分成子集。该部门是大数据分析和数据可视化的基础。不同的分析过程应用于构成数据中信息重组的子集。对于大数据，有三种情况。(1)按标的物划分，如100家银行的财务数据;以银行为单位进行划分，并对分析方法的100个输出进行进一步分析。(2)对每个子集应用分析方法，并将输出与应用的重组方法进行重组，以获得所有数据的一个结果。这可以为所有数据提供参数估计或更复杂的信息，如似然函数。D&R研究包括寻找最大限度地提高统计准确性的分割和重组方法。类似Hadoop和Spark的并行分布式环境为(1)和(2)提供了很高的计算性能。(3)类似地，对所有子集应用解析方法，但使用迭代MM算法进行优化，例如，最大似然，在其他良好的特性中可以避免非常大的矩阵反转，将不可微问题转化为光滑问题等。为了可视化，子集是通过对分析的另一个变量进行条件反射来创建分析中其他变量的子集。使用多面板显示的网格显示框架显示子集。这为多维数据集的探索性研究、数据建模和理解分析结果提供了一个非常强大的机制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd International Conference on Statistics: Theory and Applications

自引率

0.00%

发文量