Numerically stable parallel computation of (co-)variance

Erich Schubert, Michael Gertz
{"title":"Numerically stable parallel computation of (co-)variance","authors":"Erich Schubert, Michael Gertz","doi":"10.1145/3221269.3223036","DOIUrl":null,"url":null,"abstract":"With the advent of big data, we see an increasing interest in computing correlations in huge data sets with both many instances and many variables. Essential descriptive statistics such as the variance, standard deviation, covariance, and correlation can suffer from a numerical instability known as \"catastrophic cancellation\" that can lead to problems when naively computing these statistics with a popular textbook equation. While this instability has been discussed in the literature already 50 years ago, we found that even today, some high-profile tools still employ the instable version. In this paper, we study a popular incremental technique originally proposed by Welford, which we extend to weighted covariance and correlation. We also discuss strategies for further improving numerical precision, how to compute such statistics online on a data stream, with exponential aging, with missing data, and a batch parallelization for both high performance and numerical precision. We demonstrate when the numerical instability arises, and the performance of different approaches under these conditions. We showcase applications from the classic computation of variance as well as advanced applications such as stock market analysis with exponentially weighted moving models and Gaussian mixture modeling for cluster analysis that all benefit from this approach.","PeriodicalId":365491,"journal":{"name":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3221269.3223036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

Abstract

With the advent of big data, we see an increasing interest in computing correlations in huge data sets with both many instances and many variables. Essential descriptive statistics such as the variance, standard deviation, covariance, and correlation can suffer from a numerical instability known as "catastrophic cancellation" that can lead to problems when naively computing these statistics with a popular textbook equation. While this instability has been discussed in the literature already 50 years ago, we found that even today, some high-profile tools still employ the instable version. In this paper, we study a popular incremental technique originally proposed by Welford, which we extend to weighted covariance and correlation. We also discuss strategies for further improving numerical precision, how to compute such statistics online on a data stream, with exponential aging, with missing data, and a batch parallelization for both high performance and numerical precision. We demonstrate when the numerical instability arises, and the performance of different approaches under these conditions. We showcase applications from the classic computation of variance as well as advanced applications such as stock market analysis with exponentially weighted moving models and Gaussian mixture modeling for cluster analysis that all benefit from this approach.
(共)方差的数值稳定并行计算
随着大数据的出现,我们对计算具有许多实例和许多变量的巨大数据集的相关性越来越感兴趣。基本的描述性统计数据,如方差、标准差、协方差和相关性,可能会受到称为“灾难性消去”的数值不稳定性的影响,当使用流行的教科书方程天真地计算这些统计数据时,可能会导致问题。虽然这种不稳定性在50年前就已经在文献中讨论过了,但我们发现即使在今天,一些备受瞩目的工具仍然使用不稳定版本。在本文中,我们研究了Welford最初提出的一种流行的增量技术,并将其推广到加权协方差和相关。我们还讨论了进一步提高数值精度的策略,如何在数据流上在线计算这些统计数据,指数老化,丢失数据,以及高性能和数值精度的批处理并行化。我们演示了数值不稳定性的产生,以及在这些条件下不同方法的性能。我们展示了经典的方差计算应用程序以及高级应用程序,例如使用指数加权移动模型的股票市场分析和用于聚类分析的高斯混合建模,这些应用程序都受益于这种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信