{"title":"Computer-Intensive Statistics: A Promising Interplay between Statistics and Computer Science","authors":"S. Sapra","doi":"10.31021/acs.20181113","DOIUrl":null,"url":null,"abstract":"Editorial Statistics and computer science have grown as separate disciplines with little interaction for the past several decades. This however, has changed radically in recent years with the availability of massive and complex datasets in medicine, social media, and physical sciences. The statistical techniques developed for regular datasets simply cannot be scaled to meet the challenges of big data, notably the computational and statistical curses of dimensionality. The dire need to meet the challenges of big data has led to the development of statistical learning, machine learning and deep learning techniques. Rapid improvements in the speed and lower costs of statistical computation in recent years have freed statistical theory from its two serious limitations: the widespread assumption that the data follow the bell-shaped curve and exclusive focus on measures, such as mean, standard deviation, and correlation whose properties could be analyzed mathematically [1]. Computer-intensive statistical techniques have freed practical applications from the constraints of mathematical tractability and today can deal with most problems without the restrictive assumption of Gaussian distribution. These methods can be classified into frequentist and Bayesian methods. The former methods utilize the sample information only while the latter methods utilize both the sample and prior information. Frequentist statistical methods have benefitted enormously from the interaction of statistics with computer science. A very popular computer-intensive method is the bootstrap for estimating the statistical accuracy of a measure, such as correlation in a single sample. The procedure involves generating a very large number of samples with replacement from the original sample. Bootstrap as a measure of statistical accuracy has been shown to be extremely reliable in theoretical research [2,3]. Another widely used computer-intensive method for measuring the accuracy of statistical methods is cross validation. It works non-parametrically without the need for probabilistic modelling and measures the mean-squared-error for the test sample using the training sample to evaluate the performance of various machine learning methods for selecting the best method. Other frequentist statistical methods that rely on a powerful computing environment include jackknife for estimating bias and variance of an estimator, classification and regression trees for prediction, generalized linear models for parametric modelling with continuous, discrete or count response [4], generalized additive models for flexible semi-parametric regression modeling [5], the LASSO method for Cox proportional hazard regression in high dimensional settings [6], and EM algorithm [7] for finding iteratively the maximum likelihood or maximum a posteriori (MAP) estimates of parameters in complex statistical models with latent variables, alternating between performing an expectation (E) step, which evaluates the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. Bagging, random forests, and boosting [8,9] are some relatively recent developments in machine learning which use large amounts of data to fit a very rich class of functions to the data almost automatically. These methods represent a fitted model as a sum of regression trees. A regression tree by itself is a fairly weak prediction model, so these methods greatly improve prediction performance by constructing ensembles of either deep trees under random forests or shallow trees under boosting. Support vector machine (SVM) [10], an approach for linear and nonlinear classification developed in computer science, has been found to perform very well and is widely used by statisticians and data scientists. Neural networks are a class of learning methods developed separately in statistics and artificial intelligence, which use a computer-based model of human brain to perform complex tasks. These methods have found applications across several disciplines, including medicine, geosciences, hydrology, engineering, business, and economics. Some common statistical models, such as multiple linear regression, logistic regression, and linear discriminant analysis for classifying binary response are akin to neural networks. The main idea underlying these methods is to extract linear combinations of inputs as derived features and model the output as a nonlinear function of these features called the activation function. Bayesian statistical methods have also benefited greatly from computer-intensive methods, notably the Markov Chain Monte Carlo (MCMC) approach [11], which is a class Article Information","PeriodicalId":115827,"journal":{"name":"Advances in Computer Sciences","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Computer Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31021/acs.20181113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Editorial Statistics and computer science have grown as separate disciplines with little interaction for the past several decades. This however, has changed radically in recent years with the availability of massive and complex datasets in medicine, social media, and physical sciences. The statistical techniques developed for regular datasets simply cannot be scaled to meet the challenges of big data, notably the computational and statistical curses of dimensionality. The dire need to meet the challenges of big data has led to the development of statistical learning, machine learning and deep learning techniques. Rapid improvements in the speed and lower costs of statistical computation in recent years have freed statistical theory from its two serious limitations: the widespread assumption that the data follow the bell-shaped curve and exclusive focus on measures, such as mean, standard deviation, and correlation whose properties could be analyzed mathematically [1]. Computer-intensive statistical techniques have freed practical applications from the constraints of mathematical tractability and today can deal with most problems without the restrictive assumption of Gaussian distribution. These methods can be classified into frequentist and Bayesian methods. The former methods utilize the sample information only while the latter methods utilize both the sample and prior information. Frequentist statistical methods have benefitted enormously from the interaction of statistics with computer science. A very popular computer-intensive method is the bootstrap for estimating the statistical accuracy of a measure, such as correlation in a single sample. The procedure involves generating a very large number of samples with replacement from the original sample. Bootstrap as a measure of statistical accuracy has been shown to be extremely reliable in theoretical research [2,3]. Another widely used computer-intensive method for measuring the accuracy of statistical methods is cross validation. It works non-parametrically without the need for probabilistic modelling and measures the mean-squared-error for the test sample using the training sample to evaluate the performance of various machine learning methods for selecting the best method. Other frequentist statistical methods that rely on a powerful computing environment include jackknife for estimating bias and variance of an estimator, classification and regression trees for prediction, generalized linear models for parametric modelling with continuous, discrete or count response [4], generalized additive models for flexible semi-parametric regression modeling [5], the LASSO method for Cox proportional hazard regression in high dimensional settings [6], and EM algorithm [7] for finding iteratively the maximum likelihood or maximum a posteriori (MAP) estimates of parameters in complex statistical models with latent variables, alternating between performing an expectation (E) step, which evaluates the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. Bagging, random forests, and boosting [8,9] are some relatively recent developments in machine learning which use large amounts of data to fit a very rich class of functions to the data almost automatically. These methods represent a fitted model as a sum of regression trees. A regression tree by itself is a fairly weak prediction model, so these methods greatly improve prediction performance by constructing ensembles of either deep trees under random forests or shallow trees under boosting. Support vector machine (SVM) [10], an approach for linear and nonlinear classification developed in computer science, has been found to perform very well and is widely used by statisticians and data scientists. Neural networks are a class of learning methods developed separately in statistics and artificial intelligence, which use a computer-based model of human brain to perform complex tasks. These methods have found applications across several disciplines, including medicine, geosciences, hydrology, engineering, business, and economics. Some common statistical models, such as multiple linear regression, logistic regression, and linear discriminant analysis for classifying binary response are akin to neural networks. The main idea underlying these methods is to extract linear combinations of inputs as derived features and model the output as a nonlinear function of these features called the activation function. Bayesian statistical methods have also benefited greatly from computer-intensive methods, notably the Markov Chain Monte Carlo (MCMC) approach [11], which is a class Article Information