Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases

P. Pébay, D. Thompson, Janine Bennett
{"title":"Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases","authors":"P. Pébay, D. Thompson, Janine Bennett","doi":"10.1109/CLUSTER.2010.43","DOIUrl":null,"url":null,"abstract":"Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2010.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and c2 independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.
并行计算偶然性统计:设计权衡和限制情况
统计分析通常用于降低数据的维数并从数据中推断意义。任何针对大规模分布式数据的统计分析包的一个关键挑战是解决并行可伸缩性和数值稳定性的正交问题。许多统计技术,例如描述性统计或主成分分析,都是基于矩和共矩的,并且使用鲁棒的在线更新公式,可以以令人尴尬的并行方式计算,适合于map-reduce风格的实现。本文主要研究列联表,通过列联表可以直接得到联合概率和边际概率、点向互信息、信息熵和c2独立统计量等许多派生统计量。然而,随著数据大小的增加,列联表可能会变大,这相应地要求处理器之间进行大量的通信。这种潜在的通信增加阻碍了最佳的并行加速,这是与基于矩的统计(我们在[1]中讨论过)的主要区别,在基于矩的统计中,处理器间通信的数量与数据大小无关。在这里,我们提出了我们为实现列联表并行计算所做的设计权衡。我们还研究了开源实现的并行加速和可伸缩性特性。特别是,当在适当的上下文中使用偶然性统计时,即当数据输入不是准扩散时,我们观察到最佳的加速和可伸缩性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信