Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

筛选
英文 中文
CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis CLADAG 2019特刊:分类与数据分析论文选集
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-16 DOI: 10.1002/sam.11533
F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco
{"title":"CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis","authors":"F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco","doi":"10.1002/sam.11533","DOIUrl":"https://doi.org/10.1002/sam.11533","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of the Cox proportional hazards model and competing risks models to critical illness insurance data Cox比例风险模型和竞争风险模型在重大疾病保险数据中的应用
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-10 DOI: 10.1002/sam.11532
David Zapletal
{"title":"Application of the Cox proportional hazards model and competing risks models to critical illness insurance data","authors":"David Zapletal","doi":"10.1002/sam.11532","DOIUrl":"https://doi.org/10.1002/sam.11532","url":null,"abstract":"A commercial insurance company in the Czech Republic provided data on critical illness insurance. The survival analysis was used to study the influence of the gender of an insured person, the age at which the person entered into an insurance contract, and the region where the insured person lived on the occurrence of an insured event. The main goal of the research was to investigate whether the influence of explanatory variables is estimated differently when two different approaches of analysis are used. The two approaches used were (1) the Cox proportional hazard model that does not assign a specific cause, such as a certain diagnosis, to a critical illness insured event and (2) the competing risks models. Regression models related to these approaches were estimated by R software. The results, which are discussed and compared in the paper, show that insurance companies might benefit from offering policies that consider specific diagnoses as the cause of insured events. They also show that in addition to age, the gender of the client plays a key role in the occurrence of such insured events.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cluster analysis via random partition distributions 通过随机分区分布进行聚类分析
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-05 DOI: 10.1002/sam.11602
D. B. Dahl, J. Andros, J. Carter
{"title":"Cluster analysis via random partition distributions","authors":"D. B. Dahl, J. Andros, J. Carter","doi":"10.1002/sam.11602","DOIUrl":"https://doi.org/10.1002/sam.11602","url":null,"abstract":"Hierarchical and k‐medoids clustering are deterministic clustering algorithms defined on pairwise distances. We use these same pairwise distances in a novel stochastic clustering procedure based on a probability distribution. We call our proposed method CaviarPD, a portmanteau from cluster analysis via random partition distributions. CaviarPD first samples clusterings from a distribution on partitions and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. Using eight case studies, we show that our approach produces results as close to the truth as hierarchical and k‐medoids methods, and has the additional advantage of allowing for a probabilistic framework to assess clustering uncertainty. The method provides an intuitive graphical representation of clustering uncertainty through pairwise probabilities from partition samples. A software implementation of the method is available in the CaviarPD package for R.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121976658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Multi‐node Expectation–Maximization algorithm for finite mixture models 有限混合模型的多节点期望最大化算法
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-05 DOI: 10.1002/sam.11529
Sharon X. Lee, G. McLachlan, Kaleb L. Leemaqz
{"title":"Multi‐node Expectation–Maximization algorithm for finite mixture models","authors":"Sharon X. Lee, G. McLachlan, Kaleb L. Leemaqz","doi":"10.1002/sam.11529","DOIUrl":"https://doi.org/10.1002/sam.11529","url":null,"abstract":"Finite mixture models are powerful tools for modeling and analyzing heterogeneous data. Parameter estimation is typically carried out using maximum likelihood estimation via the Expectation–Maximization (EM) algorithm. Recently, the adoption of flexible distributions as component densities has become increasingly popular. Often, the EM algorithm for these models involves complicated expressions that are time‐consuming to evaluate numerically. In this paper, we describe a parallel implementation of the EM algorithm suitable for both single‐threaded and multi‐threaded processors and for both single machine and multiple‐node systems. Numerical experiments are performed to demonstrate the potential performance gain in different settings. Comparison is also made across two commonly used platforms—R and MATLAB. For illustration, a fairly general mixture model is used in the comparison.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132083052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling and inference for mixtures of simple symmetric exponential families of p ‐dimensional distributions for vectors with binary coordinates 二元坐标下向量p维分布的简单对称指数族混合的建模与推理
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-03 DOI: 10.1002/sam.11528
A. Chakraborty, S. Vardeman
{"title":"Modeling and inference for mixtures of simple symmetric exponential families of p ‐dimensional distributions for vectors with binary coordinates","authors":"A. Chakraborty, S. Vardeman","doi":"10.1002/sam.11528","DOIUrl":"https://doi.org/10.1002/sam.11528","url":null,"abstract":"We propose tractable symmetric exponential families of distributions for multivariate vectors of 0's and 1's in p dimensions, or what are referred to in this paper as binary vectors, that allow for nontrivial amounts of variation around some central value μ∈{0,1}p . We note that more or less standard asymptotics provides likelihood‐based inference in the one‐sample problem. We then consider mixture models where component distributions are of this form. Bayes analysis based on Dirichlet processes and Jeffreys priors for the exponential family parameters prove tractable and informative in problems where relevant distributions for a vector of binary variables are clearly not symmetric. We also extend our proposed Bayesian mixture model analysis to datasets with missing entries. Performance is illustrated through simulation studies and application to real datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131970853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Erratum to “Data‐driven dimension reduction in functional principal component analysis identifying the change‐point in functional data” 对“识别功能数据变化点的功能主成分分析中数据驱动的降维”的勘误
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-01 DOI: 10.1002/sam.11510
{"title":"Erratum to “Data‐driven dimension reduction in functional principal component analysis identifying the change‐point in functional data”","authors":"","doi":"10.1002/sam.11510","DOIUrl":"https://doi.org/10.1002/sam.11510","url":null,"abstract":"In the article “Data-driven dimension reduction in functional principal component analysis identifying the change-point in functional data” published in the Statistical Analysis and Data Mining: The ASA Data Science Journal Vol. 13, No. 6, p. 535, the following sentence is added in the Acknowledgements section after the first online publication. “The research of the third author Mr. Arjun Lakra is supported by a grant from Council of Scientific and Industrial Research (CSIR Award No.: 09/081(1350)/2019-EMR-I), Government of India.” We apologize for this error.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133109731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments 对闭孔实验仿真的递推多保真度模型进行了实际推广
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-25 DOI: 10.1002/sam.11513
Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera
{"title":"A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments","authors":"Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera","doi":"10.1002/sam.11513","DOIUrl":"https://doi.org/10.1002/sam.11513","url":null,"abstract":"In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116804886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data 在高通量成分数据中基于偏最小二乘的标记发现的加权枢轴坐标
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-19 DOI: 10.1002/sam.11514
N. Štefelová, J. Palarea‐Albaladejo, K. Hron
{"title":"Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data","authors":"N. Štefelová, J. Palarea‐Albaladejo, K. Hron","doi":"10.1002/sam.11514","DOIUrl":"https://doi.org/10.1002/sam.11514","url":null,"abstract":"High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Evaluating causal‐based feature selection for fuel property prediction models 评估基于因果关系的燃料特性预测模型的特征选择
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-11 DOI: 10.1002/sam.11511
Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson
{"title":"Evaluating causal‐based feature selection for fuel property prediction models","authors":"Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson","doi":"10.1002/sam.11511","DOIUrl":"https://doi.org/10.1002/sam.11511","url":null,"abstract":"In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Markov chain to analyze web usability of a university website using eye tracking data 马尔可夫链利用眼动追踪数据分析大学网站的可用性
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-10 DOI: 10.1002/sam.11512
Gianpaolo Zammarchi, L. Frigau, F. Mola
{"title":"Markov chain to analyze web usability of a university website using eye tracking data","authors":"Gianpaolo Zammarchi, L. Frigau, F. Mola","doi":"10.1002/sam.11512","DOIUrl":"https://doi.org/10.1002/sam.11512","url":null,"abstract":"Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130065955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信