Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

筛选
英文 中文
Sketched Stochastic Dictionary Learning for large‐scale data and application to high‐throughput mass spectrometry 大规模数据的随机字典学习和高通量质谱的应用
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-08-20 DOI: 10.1002/sam.11542
O. Permiakova, T. Burger
{"title":"Sketched Stochastic Dictionary Learning for large‐scale data and application to high‐throughput mass spectrometry","authors":"O. Permiakova, T. Burger","doi":"10.1002/sam.11542","DOIUrl":"https://doi.org/10.1002/sam.11542","url":null,"abstract":"Factorization of large data corpora has emerged as an essential technique to extract dictionaries (sets of patterns that are meaningful for sparse encoding). Following this line, we present a novel algorithm based on compressive learning theory. In this framework, the (arbitrarily large) dataset of interest is replaced by a fixed‐size sketch resulting from a random sampling of the data distribution characteristic function. We apply our algorithm to the extraction of chromatographic elution profiles in mass spectrometry data, where it demonstrates its efficiency and interest compared to other related algorithms.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"101 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114095267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Weighted validation of heteroscedastic regression models for better selection 异方差回归模型的加权验证,以获得更好的选择
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-08-17 DOI: 10.1002/sam.11544
Yoonsuh Jung, Hayoung Kim
{"title":"Weighted validation of heteroscedastic regression models for better selection","authors":"Yoonsuh Jung, Hayoung Kim","doi":"10.1002/sam.11544","DOIUrl":"https://doi.org/10.1002/sam.11544","url":null,"abstract":"In this paper, we suggest a method for improving model selection in the presence of heteroscedasticity. For this purpose, we measure the heteroscedasticity in the data using the inter‐quartile range (IQR) of the fitted values under the framework of cross‐validation. To find the IQR, we fit 0.25 and 0.75 generic quantile regression using the training data. The two models then predict the values of the response variable at 0.25 and 0.75 quantiles in the test data, which yields predicted IQR. To reduce the effect of heteroscedastic data in the model selection, we propose to use weighted prediction error. The inverse of the predicted IQR is utilized to estimate the weights. The proposed method reduces the impact of large prediction errors via weighted prediction and leads to better model and parameter selection. The benefits of the proposed method are demonstrated in simulations and with two real data sets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125180706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modal linear regression models with multiplicative distortion measurement errors 具有乘性失真测量误差的模态线性回归模型
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-08-10 DOI: 10.1002/sam.11541
Jun Zhang, Gaorong Li, Yiping Yang
{"title":"Modal linear regression models with multiplicative distortion measurement errors","authors":"Jun Zhang, Gaorong Li, Yiping Yang","doi":"10.1002/sam.11541","DOIUrl":"https://doi.org/10.1002/sam.11541","url":null,"abstract":"We consider modal linear regression models when neither the response variable nor the covariates can be directly observed, but are measured with multiplicative distortion measurement errors. Four calibration procedures are used to estimate parameters in the modal linear regression models, namely, conditional mean calibration, conditional absolute mean calibration, conditional variance calibration, and conditional absolute logarithmic calibration. The asymptotic properties for the estimators based on four calibration procedures are established. Monte Carlo simulation experiments are conducted to examine the performance of the proposed estimators. The proposed estimators are applied to analyze a forest fires dataset for an illustration.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114096519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Multivariate Gaussian RBF‐net for smooth function estimation and variable selection 多元高斯RBF - net平滑函数估计和变量选择
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-08-03 DOI: 10.1002/sam.11540
Arkaprava Roy
{"title":"Multivariate Gaussian RBF‐net for smooth function estimation and variable selection","authors":"Arkaprava Roy","doi":"10.1002/sam.11540","DOIUrl":"https://doi.org/10.1002/sam.11540","url":null,"abstract":"Neural networks are routinely used for nonparametric regression modeling. The interest in these models is growing with ever‐increasing complexities in modern datasets. With modern technological advancements, the number of predictors frequently exceeds the sample size in many application areas. Thus, selecting important predictors from the huge pool is an extremely important task for judicious inference. This paper proposes a novel flexible class of single‐layer radial basis functions (RBF) networks. The proposed architecture can estimate smooth unknown regression functions and also perform variable selection. We primarily focus on Gaussian RBF‐net due to its attractive properties. The extensions to other choices of RBF are fairly straightforward. The proposed architecture is also shown to be effective in identifying relevant predictors in a low‐dimensional setting using the posterior samples without imposing any sparse estimation scheme. We develop an efficient Markov chain Monte Carlo algorithm to generate posterior samples of the parameters. We illustrate the proposed method's empirical efficacy through simulation experiments, both in high and low dimensional regression problems. The posterior contraction rate is established with respect to empirical ℓ2 distance assuming that the error variance is unknown, and the true function belongs to a Hölder ball. We illustrate our method in a Human Connectome Project dataset to predict vocabulary comprehension and to identify important edges of the structural connectome.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133067313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Negative binomial graphical model with excess zeros 带有多余零的负二项图模型
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-07-21 DOI: 10.1002/sam.11536
Beomjin Park, Hosik Choi, Changyi Park
{"title":"Negative binomial graphical model with excess zeros","authors":"Beomjin Park, Hosik Choi, Changyi Park","doi":"10.1002/sam.11536","DOIUrl":"https://doi.org/10.1002/sam.11536","url":null,"abstract":"Markov random field or undirected graphical models (GM) are a popular class of GM useful in various fields because they provide an intuitive and interpretable graph expressing the complex relationship between random variables. The zero‐inflated local Poisson graphical model has been proposed as a graphical model for count data with excess zeros. However, as count data are often characterized by over‐dispersion, the local Poisson graphical model may suffer from a poor fit to data. In this paper, we propose a zero‐inflated local negative binomial (NB) graphical model. Due to the dependencies of parameters in our models, a direct optimization of the objective function is difficult. Instead, we devise expectation‐minimization algorithms based on two different parametrizations for the NB distribution. Through a simulation study, we illustrate the effectiveness of our method for learning network structure from over‐dispersed count data with excess zeros. We further apply our method to real data to estimate its network structure.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124654145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data 驾驶风险的评估与解释:基于远程信息处理数据的汽车索赔频率建模
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-07-20 DOI: 10.2139/ssrn.3910216
Yaqian Gao, Yifan Huang, Shengwang Meng
{"title":"Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data","authors":"Yaqian Gao, Yifan Huang, Shengwang Meng","doi":"10.2139/ssrn.3910216","DOIUrl":"https://doi.org/10.2139/ssrn.3910216","url":null,"abstract":"With the development of vehicle telematics and data mining technology, usage‐based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low‐risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Power grid frequency prediction using spatiotemporal modeling 基于时空建模的电网频率预测
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-07-06 DOI: 10.1002/sam.11535
Amanda Lenzi, J. Bessac, M. Anitescu
{"title":"Power grid frequency prediction using spatiotemporal modeling","authors":"Amanda Lenzi, J. Bessac, M. Anitescu","doi":"10.1002/sam.11535","DOIUrl":"https://doi.org/10.1002/sam.11535","url":null,"abstract":"Understanding power system dynamics is essential for interarea oscillation analysis and the detection of grid instabilities. The FNET/GridEye is a GPS‐synchronized wide‐area frequency measurement network that provides an accurate picture of the normal real‐time operational condition of the power system dynamics, giving rise to new and intricate spatiotemporal patterns of power loads. We propose to model FNET/GridEye grid frequency data from the U.S. Eastern Interconnection with a spatiotemporal statistical model. We predict the frequency data at locations without observations, a critical need during disruption events where measurement data are inaccessible. Spatial information is accounted for either as neighboring measurements in the form of covariates or with a spatiotemporal correlation model captured by a latent Gaussian field. The proposed method is useful in estimating power system dynamic response from limited phasor measurements and holds promise for predicting instability that may lead to undesirable effects such as cascading outages.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132414630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Analyzing relevance vector machines using a single penalty approach 使用单一惩罚方法分析相关向量机
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-07-05 DOI: 10.1002/sam.11551
A. Dixit, Vivekananda Roy
{"title":"Analyzing relevance vector machines using a single penalty approach","authors":"A. Dixit, Vivekananda Roy","doi":"10.1002/sam.11551","DOIUrl":"https://doi.org/10.1002/sam.11551","url":null,"abstract":"Relevance vector machine (RVM) is a popular sparse Bayesian learning model typically used for prediction. Recently it has been shown that improper priors assumed on multiple penalty parameters in RVM may lead to an improper posterior. Currently in the literature, the sufficient conditions for posterior propriety of RVM do not allow improper priors over the multiple penalty parameters. In this article, we propose a single penalty relevance vector machine (SPRVM) model in which multiple penalty parameters are replaced by a single penalty and we consider a semi‐Bayesian approach for fitting the SPRVM. The necessary and sufficient conditions for posterior propriety of SPRVM are more liberal than those of RVM and allow for several improper priors over the penalty parameter. Additionally, we also prove the geometric ergodicity of the Gibbs sampler used to analyze the SPRVM model and hence can estimate the asymptotic standard errors associated with the Monte Carlo estimate of the means of the posterior predictive distribution. Such a Monte Carlo standard error cannot be computed in the case of RVM, since the rate of convergence of the Gibbs sampler used to analyze RVM is not known. The predictive performance of RVM and SPRVM is compared by analyzing two simulation examples and three real life datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132586838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coefficient tree regression for generalized linear models 广义线性模型的系数树回归
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-07-02 DOI: 10.1002/sam.11534
Özge Sürer, D. Apley, E. Malthouse
{"title":"Coefficient tree regression for generalized linear models","authors":"Özge Sürer, D. Apley, E. Malthouse","doi":"10.1002/sam.11534","DOIUrl":"https://doi.org/10.1002/sam.11534","url":null,"abstract":"Large regression data sets are now commonplace, with so many predictors that they cannot or should not all be included individually. In practice, derived predictors are relevant as meaningful features or, at the very least, as a form of regularized approximation of the true coefficients. We consider derived predictors that are the sum of some groups of individual predictors, which is equivalent to predictors within a group sharing the same coefficient. However, the groups of predictors are usually not known in advance and must be discovered from the data. In this paper we develop a coefficient tree regression algorithm for generalized linear models to discover the group structure from the data. The approach results in simple and highly interpretable models, and we demonstrated with real examples that it can provide a clear and concise interpretation of the data. Via simulation studies under different scenarios we showed that our approach performs better than existing competitors in terms of computing time and predictive accuracy.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"45 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125830306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fourier neural networks as function approximators and differential equation solvers 傅里叶神经网络作为函数逼近器和微分方程求解器
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-22 DOI: 10.1002/sam.11531
M. Ngom, O. Marin
{"title":"Fourier neural networks as function approximators and differential equation solvers","authors":"M. Ngom, O. Marin","doi":"10.1002/sam.11531","DOIUrl":"https://doi.org/10.1002/sam.11531","url":null,"abstract":"We present a Fourier neural network (FNN) that can be mapped directly to the Fourier decomposition. The choice of activation and loss function yields results that replicate a Fourier series expansion closely while preserving a straightforward architecture with a single hidden layer. The simplicity of this network architecture facilitates the integration with any other higher‐complexity networks, at a data pre‐ or postprocessing stage. We validate this FNN on naturally periodic smooth functions and on piecewise continuous periodic functions. We showcase the use of this FNN for modeling or solving partial differential equations with periodic boundary conditions. The main advantages of the current approach are the validity of the solution outside the training region, interpretability of the trained model, and simplicity of use.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117015691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信