Computer-Intensive Statistics: A Promising Interplay between Statistics and Computer Science

Advances in Computer Sciences Pub Date : 2018-10-04 DOI:10.31021/acs.20181113

S. Sapra

{"title":"Computer-Intensive Statistics: A Promising Interplay between Statistics and Computer Science","authors":"S. Sapra","doi":"10.31021/acs.20181113","DOIUrl":null,"url":null,"abstract":"Editorial Statistics and computer science have grown as separate disciplines with little interaction for the past several decades. This however, has changed radically in recent years with the availability of massive and complex datasets in medicine, social media, and physical sciences. The statistical techniques developed for regular datasets simply cannot be scaled to meet the challenges of big data, notably the computational and statistical curses of dimensionality. The dire need to meet the challenges of big data has led to the development of statistical learning, machine learning and deep learning techniques. Rapid improvements in the speed and lower costs of statistical computation in recent years have freed statistical theory from its two serious limitations: the widespread assumption that the data follow the bell-shaped curve and exclusive focus on measures, such as mean, standard deviation, and correlation whose properties could be analyzed mathematically [1]. Computer-intensive statistical techniques have freed practical applications from the constraints of mathematical tractability and today can deal with most problems without the restrictive assumption of Gaussian distribution. These methods can be classified into frequentist and Bayesian methods. The former methods utilize the sample information only while the latter methods utilize both the sample and prior information. Frequentist statistical methods have benefitted enormously from the interaction of statistics with computer science. A very popular computer-intensive method is the bootstrap for estimating the statistical accuracy of a measure, such as correlation in a single sample. The procedure involves generating a very large number of samples with replacement from the original sample. Bootstrap as a measure of statistical accuracy has been shown to be extremely reliable in theoretical research [2,3]. Another widely used computer-intensive method for measuring the accuracy of statistical methods is cross validation. It works non-parametrically without the need for probabilistic modelling and measures the mean-squared-error for the test sample using the training sample to evaluate the performance of various machine learning methods for selecting the best method. Other frequentist statistical methods that rely on a powerful computing environment include jackknife for estimating bias and variance of an estimator, classification and regression trees for prediction, generalized linear models for parametric modelling with continuous, discrete or count response [4], generalized additive models for flexible semi-parametric regression modeling [5], the LASSO method for Cox proportional hazard regression in high dimensional settings [6], and EM algorithm [7] for finding iteratively the maximum likelihood or maximum a posteriori (MAP) estimates of parameters in complex statistical models with latent variables, alternating between performing an expectation (E) step, which evaluates the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. Bagging, random forests, and boosting [8,9] are some relatively recent developments in machine learning which use large amounts of data to fit a very rich class of functions to the data almost automatically. These methods represent a fitted model as a sum of regression trees. A regression tree by itself is a fairly weak prediction model, so these methods greatly improve prediction performance by constructing ensembles of either deep trees under random forests or shallow trees under boosting. Support vector machine (SVM) [10], an approach for linear and nonlinear classification developed in computer science, has been found to perform very well and is widely used by statisticians and data scientists. Neural networks are a class of learning methods developed separately in statistics and artificial intelligence, which use a computer-based model of human brain to perform complex tasks. These methods have found applications across several disciplines, including medicine, geosciences, hydrology, engineering, business, and economics. Some common statistical models, such as multiple linear regression, logistic regression, and linear discriminant analysis for classifying binary response are akin to neural networks. The main idea underlying these methods is to extract linear combinations of inputs as derived features and model the output as a nonlinear function of these features called the activation function. Bayesian statistical methods have also benefited greatly from computer-intensive methods, notably the Markov Chain Monte Carlo (MCMC) approach [11], which is a class Article Information","PeriodicalId":115827,"journal":{"name":"Advances in Computer Sciences","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Computer Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31021/acs.20181113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Editorial Statistics and computer science have grown as separate disciplines with little interaction for the past several decades. This however, has changed radically in recent years with the availability of massive and complex datasets in medicine, social media, and physical sciences. The statistical techniques developed for regular datasets simply cannot be scaled to meet the challenges of big data, notably the computational and statistical curses of dimensionality. The dire need to meet the challenges of big data has led to the development of statistical learning, machine learning and deep learning techniques. Rapid improvements in the speed and lower costs of statistical computation in recent years have freed statistical theory from its two serious limitations: the widespread assumption that the data follow the bell-shaped curve and exclusive focus on measures, such as mean, standard deviation, and correlation whose properties could be analyzed mathematically [1]. Computer-intensive statistical techniques have freed practical applications from the constraints of mathematical tractability and today can deal with most problems without the restrictive assumption of Gaussian distribution. These methods can be classified into frequentist and Bayesian methods. The former methods utilize the sample information only while the latter methods utilize both the sample and prior information. Frequentist statistical methods have benefitted enormously from the interaction of statistics with computer science. A very popular computer-intensive method is the bootstrap for estimating the statistical accuracy of a measure, such as correlation in a single sample. The procedure involves generating a very large number of samples with replacement from the original sample. Bootstrap as a measure of statistical accuracy has been shown to be extremely reliable in theoretical research [2,3]. Another widely used computer-intensive method for measuring the accuracy of statistical methods is cross validation. It works non-parametrically without the need for probabilistic modelling and measures the mean-squared-error for the test sample using the training sample to evaluate the performance of various machine learning methods for selecting the best method. Other frequentist statistical methods that rely on a powerful computing environment include jackknife for estimating bias and variance of an estimator, classification and regression trees for prediction, generalized linear models for parametric modelling with continuous, discrete or count response [4], generalized additive models for flexible semi-parametric regression modeling [5], the LASSO method for Cox proportional hazard regression in high dimensional settings [6], and EM algorithm [7] for finding iteratively the maximum likelihood or maximum a posteriori (MAP) estimates of parameters in complex statistical models with latent variables, alternating between performing an expectation (E) step, which evaluates the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. Bagging, random forests, and boosting [8,9] are some relatively recent developments in machine learning which use large amounts of data to fit a very rich class of functions to the data almost automatically. These methods represent a fitted model as a sum of regression trees. A regression tree by itself is a fairly weak prediction model, so these methods greatly improve prediction performance by constructing ensembles of either deep trees under random forests or shallow trees under boosting. Support vector machine (SVM) [10], an approach for linear and nonlinear classification developed in computer science, has been found to perform very well and is widely used by statisticians and data scientists. Neural networks are a class of learning methods developed separately in statistics and artificial intelligence, which use a computer-based model of human brain to perform complex tasks. These methods have found applications across several disciplines, including medicine, geosciences, hydrology, engineering, business, and economics. Some common statistical models, such as multiple linear regression, logistic regression, and linear discriminant analysis for classifying binary response are akin to neural networks. The main idea underlying these methods is to extract linear combinations of inputs as derived features and model the output as a nonlinear function of these features called the activation function. Bayesian statistical methods have also benefited greatly from computer-intensive methods, notably the Markov Chain Monte Carlo (MCMC) approach [11], which is a class Article Information

查看原文本刊更多论文

计算机密集统计:统计学和计算机科学之间有前途的相互作用

在过去的几十年里，统计学和计算机科学作为独立的学科发展起来，几乎没有相互作用。然而，近年来，随着医学、社交媒体和物理科学领域大量复杂数据集的出现，这种情况发生了根本性的变化。为常规数据集开发的统计技术根本无法扩展以满足大数据的挑战，特别是维数的计算和统计诅咒。应对大数据挑战的迫切需要导致了统计学习、机器学习和深度学习技术的发展。近年来，统计计算速度的快速提高和成本的降低使统计理论摆脱了两个严重的局限性:普遍假设数据遵循钟形曲线，并且只关注可以用数学方法分析其性质的度量，如平均值、标准差和相关性[1]。计算机密集型统计技术将实际应用从数学可追溯性的限制中解放出来，今天可以处理大多数问题，而不需要高斯分布的限制性假设。这些方法可分为频率方法和贝叶斯方法。前一种方法只利用样本信息，后一种方法同时利用样本和先验信息。频率统计方法从统计学与计算机科学的相互作用中获益匪浅。一种非常流行的计算机密集型方法是用于估计度量的统计精度的自举法，例如单个样本中的相关性。该过程涉及生成大量的样品，并从原始样品中替换。在理论研究中，Bootstrap作为一种统计精度的度量已被证明是非常可靠的[2,3]。另一种广泛使用的测量统计方法准确性的计算机密集型方法是交叉验证。它的工作方式是非参数化的，不需要进行概率建模，并使用训练样本测量测试样本的均方误差，以评估各种机器学习方法的性能，以选择最佳方法。其他依赖于强大计算环境的频率统计方法包括用于估计估计器的偏差和方差的jackknife，用于预测的分类和回归树，用于连续、离散或计数响应参数建模的广义线性模型[4]，用于灵活半参数回归建模的广义加性模型[5]，用于高维设置中的Cox比例风险回归的LASSO方法[6]，EM算法[7]，用于迭代地寻找具有潜在变量的复杂统计模型中参数的最大似然或最大后验(MAP)估计，在执行期望(E)步骤(评估使用参数当前估计评估的对数似然的期望)和最大化(M)步骤(计算最大化E步上发现的期望对数似然的参数)之间交替进行。Bagging、随机森林和boosting[8,9]是机器学习中一些相对较新的发展，它们使用大量数据几乎自动地将非常丰富的函数类拟合到数据中。这些方法将拟合模型表示为回归树的总和。回归树本身是一个相当弱的预测模型，因此这些方法通过构建随机森林下的深树或增强下的浅树的集合大大提高了预测性能。支持向量机(SVM)[10]是计算机科学中发展起来的一种用于线性和非线性分类的方法，已经被发现表现非常好，并被统计学家和数据科学家广泛使用。神经网络是在统计学和人工智能中分别发展起来的一类学习方法，它使用基于计算机的人类大脑模型来执行复杂的任务。这些方法已经在多个学科中得到了应用，包括医学、地球科学、水文学、工程、商业和经济学。一些常见的统计模型，如多元线性回归、逻辑回归和二元响应分类的线性判别分析，类似于神经网络。这些方法的主要思想是提取输入的线性组合作为衍生特征，并将输出建模为这些特征的非线性函数，称为激活函数。贝叶斯统计方法也极大地受益于计算机密集型方法，特别是马尔可夫链蒙特卡罗(MCMC)方法[11]，这是一类文章信息

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Computer Sciences

自引率

0.00%

发文量