Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献_第9页

CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis CLADAG 2019特刊:分类与数据分析论文选集

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-16 DOI: 10.1002/sam.11533

F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco

{"title":"CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis","authors":"F. Greselin, T. B. Murphy, G. C. Porzio, D. Vistocco","doi":"10.1002/sam.11533","DOIUrl":"https://doi.org/10.1002/sam.11533","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining collects papers presented at the 12th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, 11–13 September 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and six of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by Jirí Dvorák et al. (available in Stat Anal Data Min: The ASA Data Sci Journal. 2020;13:548–564) introduces the Clover plot, an easy-to-understand graphical tool that facilitates the appropriate choice of a classifier, to be employed in supervised classification. It combines four complementary classifiers—the depth–depth plot, the bagdistance plot, an approach based on the illumination, and the classical diagnostic plot based on Mahalanobis distances. It borrows strengths from all these methodologies, contrasts them, and allows interpretations about the structure of the data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation–Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motiv","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115314784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of the Cox proportional hazards model and competing risks models to critical illness insurance data Cox比例风险模型和竞争风险模型在重大疾病保险数据中的应用

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-10 DOI: 10.1002/sam.11532

David Zapletal

引用次数: 0

Cluster analysis via random partition distributions 通过随机分区分布进行聚类分析

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-05 DOI: 10.1002/sam.11602

D. B. Dahl, J. Andros, J. Carter

引用次数: 3

Multi‐node Expectation–Maximization algorithm for finite mixture models 有限混合模型的多节点期望最大化算法

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-05 DOI: 10.1002/sam.11529

Sharon X. Lee, G. McLachlan, Kaleb L. Leemaqz

引用次数: 0

Modeling and inference for mixtures of simple symmetric exponential families of p ‐dimensional distributions for vectors with binary coordinates 二元坐标下向量p维分布的简单对称指数族混合的建模与推理

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-03 DOI: 10.1002/sam.11528

A. Chakraborty, S. Vardeman

引用次数: 0

Erratum to “Data‐driven dimension reduction in functional principal component analysis identifying the change‐point in functional data” 对“识别功能数据变化点的功能主成分分析中数据驱动的降维”的勘误

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-06-01 DOI: 10.1002/sam.11510

引用次数: 0

A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments 对闭孔实验仿真的递推多保真度模型进行了实际推广

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-25 DOI: 10.1002/sam.11513

Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera

{"title":"A practical extension of the recursive multi‐fidelity model for the emulation of hole closure experiments","authors":"Amanda Muyskens, Kathleen L. Schmidt, Matthew D. Nelms, N. Barton, J. Florando, A. Kupresanin, David Rivera","doi":"10.1002/sam.11513","DOIUrl":"https://doi.org/10.1002/sam.11513","url":null,"abstract":"In regimes of high strain rate, the strength of materials often cannot be measured directly in experiments. Instead, the strength is inferred based on an experimental observable, such as a change in shape, that is matched by simulations supported by a known strength model. In hole closure experiments, the rate and degree to which a central hole in a plate of material closes during a dynamic loading event are used to infer material strength parameters. Due to the complexity of the experiment, many computationally expensive, three‐dimensional simulations are necessary to train an emulator for calibration or other analyses. These simulations can be run at multiple grid resolutions, where dense grids are slower but more accurate. In an effort to reduce the computational cost, a combination of simulations with different resolutions can be combined to develop an accurate emulator within a limited training time. We explore the novel design and construction of an appropriate functional recursive multi‐fidelity emulator of a strength model for tantalum in hole closure experiments that can be applied to arbitrarily large training data. Hence, by formulating a multi‐fidelity model to employ low‐fidelity simulations, we were able to reduce the error of our emulator by approximately 81% with only an approximately 1.6% increase in computing resource utilization.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116804886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data 在高通量成分数据中基于偏最小二乘的标记发现的加权枢轴坐标

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-19 DOI: 10.1002/sam.11514

N. Štefelová, J. Palarea‐Albaladejo, K. Hron

{"title":"Weighted pivot coordinates for partial least squares‐based marker discovery in high‐throughput compositional data","authors":"N. Štefelová, J. Palarea‐Albaladejo, K. Hron","doi":"10.1002/sam.11514","DOIUrl":"https://doi.org/10.1002/sam.11514","url":null,"abstract":"High‐throughput data representing large mixtures of chemical or biological signals are ordinarily produced in the molecular sciences. Given a number of samples, partial least squares (PLS) regression is a well‐established statistical method to investigate associations between them and any continuous response variables of interest. However, technical artifacts generally make the raw signals not directly comparable between samples. Thus, data normalization is required before any meaningful scientific information can be drawn. This often allows to characterize the processed signals as compositional data where the relevant information is contained in the pairwise log‐ratios between the components of the mixture. The (log‐ratio) pivot coordinate approach facilitates the aggregation into single variables of the pairwise log‐ratios of a component to all the remaining components. This simplifies interpretability and the investigation of their relative importance but, particularly in a high‐dimensional context, the aggregated log‐ratios can easily mix up information from different underlaying processes. In this context, we propose a weighting strategy for the construction of pivot coordinates for PLS regression which draws on the correlation between response variable and pairwise log‐ratios. Using real and simulated data sets, we demonstrate that this proposal enhances the discovery of biological markers in high‐throughput compositional data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130172352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Evaluating causal‐based feature selection for fuel property prediction models 评估基于因果关系的燃料特性预测模型的特征选择

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-11 DOI: 10.1002/sam.11511

Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson

{"title":"Evaluating causal‐based feature selection for fuel property prediction models","authors":"Bernard Nguyen, Leanne S. Whitmore, Anthe George, Corey M. Hudson","doi":"10.1002/sam.11511","DOIUrl":"https://doi.org/10.1002/sam.11511","url":null,"abstract":"In‐silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench‐scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation‐based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal‐based feature selection might have on both model performance and identification of key molecular substructures. We found that causal‐based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121057193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Markov chain to analyze web usability of a university website using eye tracking data 马尔可夫链利用眼动追踪数据分析大学网站的可用性

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2021-05-10 DOI: 10.1002/sam.11512

Gianpaolo Zammarchi, L. Frigau, F. Mola

{"title":"Markov chain to analyze web usability of a university website using eye tracking data","authors":"Gianpaolo Zammarchi, L. Frigau, F. Mola","doi":"10.1002/sam.11512","DOIUrl":"https://doi.org/10.1002/sam.11512","url":null,"abstract":"Web usability is a crucial feature of a website, allowing users to easily find information in a short time. Eye tracking data registered during the execution of tasks allow to measure web usability in a more objective way compared to questionnaires. In this work, we evaluated the web usability of the website of the University of Cagliari through the analysis of eye tracking data with qualitative and quantitative methods. Performances of two groups of students (i.e., high school and university students) across 10 different tasks were compared in terms of time to completion, number of fixations and difficulty ratio. Transitions between different areas of interest (AOI) were analyzed in the two groups using Markov chain. For the majority of tasks, we did not observe significant differences in the performances of the two groups, suggesting that the information needed to complete the tasks could easily be retrieved by students with little previous experience in using the website. For a specific task, high school students showed a worse performance based on the number of fixations and a different Markov chain stationary distribution compared to university students. These results allowed to highlight elements of the pages that can be modified to improve web usability.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130065955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5