Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI:10.1109/CLUSTER.2010.43

P. Pébay, D. Thompson, Janine Bennett

{"title":"Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases","authors":"P. Pébay, D. Thompson, Janine Bennett","doi":"10.1109/CLUSTER.2010.43","DOIUrl":null,"url":null,"abstract":"Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2010.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

查看原文本刊更多论文

并行计算偶然性统计:设计权衡和限制情况

统计分析通常用于降低数据的维数并从数据中推断意义。任何针对大规模分布式数据的统计分析包的一个关键挑战是解决并行可伸缩性和数值稳定性的正交问题。许多统计技术，例如描述性统计或主成分分析，都是基于矩和共矩的，并且使用鲁棒的在线更新公式，可以以令人尴尬的并行方式计算，适合于map-reduce风格的实现。本文主要研究列联表，通过列联表可以直接得到联合概率和边际概率、点向互信息、信息熵和c2独立统计量等许多派生统计量。然而，随著数据大小的增加，列联表可能会变大，这相应地要求处理器之间进行大量的通信。这种潜在的通信增加阻碍了最佳的并行加速，这是与基于矩的统计(我们在[1]中讨论过)的主要区别，在基于矩的统计中，处理器间通信的数量与数据大小无关。在这里，我们提出了我们为实现列联表并行计算所做的设计权衡。我们还研究了开源实现的并行加速和可伸缩性特性。特别是，当在适当的上下文中使用偶然性统计时，即当数据输入不是准扩散时，我们观察到最佳的加速和可伸缩性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量