Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok
{"title":"Neural interval‐censored survival regression with feature selection","authors":"Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok","doi":"10.1002/sam.11704","DOIUrl":"https://doi.org/10.1002/sam.11704","url":null,"abstract":"Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"87 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141642725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley
{"title":"Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design","authors":"N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley","doi":"10.1002/sam.11698","DOIUrl":"https://doi.org/10.1002/sam.11698","url":null,"abstract":"Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141402295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis","authors":"Jiali Lin, Inyoung Kim","doi":"10.1002/sam.11699","DOIUrl":"https://doi.org/10.1002/sam.11699","url":null,"abstract":"Analyzing\u0000correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141409881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Confidence bounds for threshold similarity graph in random variable network","authors":"P. Koldanov, A. Koldanov, D. P. Semenov","doi":"10.1002/sam.11642","DOIUrl":"https://doi.org/10.1002/sam.11642","url":null,"abstract":"Problem of uncertainty of graph structure identification in random variable network is considered. An approach for the construction of upper and lower confidence bounds for graph structures is developed. This approach is applied for the construction of upper and lower confidence bounds for the threshold similarity graph. The stability of confidence bounds and gaps between upper and lower confidence bounds are investigated. Theoretical results are illustrated by numerical experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"50 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123519121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Improved D2GAN‐based oversampling algorithm for imbalanced data classification","authors":"Xiaoqiang Zhao, Qi Yao","doi":"10.1002/sam.11640","DOIUrl":"https://doi.org/10.1002/sam.11640","url":null,"abstract":"To address the problems of pattern collapse, uncontrollable data generation and high overlap rate when generative adversarial network (GAN) oversamples imbalanced data, we propose an imbalanced data oversampling algorithm based on improved dual discriminator generative adversarial nets (D2GAN). First, we integrate the positive class attribute information into the generator and the discriminator to ensure that the generator only generates the samples for positive class samples, which overcomes the problem of uncontrollable data generation by the generator. Second, we introduce a classifier into D2GAN for discriminating the generated samples and the original data, which avoids the overlap among the generated samples and the negative class samples, and ensures the diversity of the generated samples, the problem of pattern collapse is solved. Finally, the performance of the proposed algorithm is evaluated on 9 datasets by using SVM and neural network classification algorithm for oversampling experiments, the results show that the proposed algorithm effectively improve the classification performance of imbalanced data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123666575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass
{"title":"A neutral zone classifier for three classes with an application to text mining","authors":"Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass","doi":"10.1002/sam.11639","DOIUrl":"https://doi.org/10.1002/sam.11639","url":null,"abstract":"A classifier may be limited by its conditional misclassification rates more than its overall misclassification rate. In the case that one or more of the conditional misclassification rates are high, a neutral zone may be introduced to decrease and possibly balance the misclassification rates. In this paper, a neutral zone is incorporated into a three‐class classifier with its region determined by controlling conditional misclassification rates. The neutral zone classifier is illustrated with a text mining application that classifies written comments associated with student evaluations of teaching.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126453612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ensemble learning for score likelihood ratios under the common source problem","authors":"Federico Veneri, Danica M. Ommen","doi":"10.1002/sam.11637","DOIUrl":"https://doi.org/10.1002/sam.11637","url":null,"abstract":"Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125485310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLADAG 2021 special issue: Selected papers on classification and data analysis","authors":"C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio","doi":"10.1002/sam.11633","DOIUrl":"https://doi.org/10.1002/sam.11633","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132268384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava
{"title":"A deep learning factor analysis model based on importance‐weighted variational inference and normalizing flow priors: Evaluation within a set of multidimensional performance assessments in youth elite soccer players","authors":"P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava","doi":"10.1002/sam.11632","DOIUrl":"https://doi.org/10.1002/sam.11632","url":null,"abstract":"Exploratory factor analysis is a widely used framework in the social and behavioral sciences. Since measurement errors are always present in human behavior data, latent factors, generating the observed data, are important to identify. While most factor analysis methods rely on linear relationships in the data‐generating process, deep learning models can provide more flexible modeling approaches. However, two problems need to be addressed. First, for interpretation, scaling assumptions are required, which can be (at least) cumbersome for deep generative models. Second, deep generative models are typically not identifiable, which is required in order to identify the underlying latent constructs. We developed a model that uses a variational autoencoder as an estimator for a complex factor analysis model based on importance‐weighted variational inference. In order to receive interpretable results and an identified model, we use a linear factor model with identification constraints in the measurement model. To maintain the flexibility of the model, we use normalizing flow latent priors. Within the evaluation of performance measures in a talent development program in soccer, we found more clarity in the separation of the identified underlying latent dimensions with our models compared to traditional PCA analyses.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new parametric approach to gender gap with application to EUSILC data in Poland and Italy","authors":"F. Greselin, Alina Jȩdrzejczak, Kamila Trzcińska","doi":"10.1002/sam.11623","DOIUrl":"https://doi.org/10.1002/sam.11623","url":null,"abstract":"Real income distribution comparisons are of interest to policy makers across European countries. Nowadays, a crucial component of income inequality remains the discrepancy between men and women, often called the gender gap. Since the gender gap is related to the whole distribution of incomes in a population, popular single metrics are not adequate, and previous studies applied the relative distribution method, a non‐parametric approach to the comparison of distributions. Here, we propose a parametric approach for estimating the relative distribution. Then we extend it to assess the impact of selected covariates—related to the personal characteristics of the samples—on the existing gender gap in both countries. In more detail, models for income were fitted to empirical data from Poland and Italy, from the European Survey of Income and Living Conditions (wave 2018). Afterwards, their parameters were employed to obtain the estimates of relative distribution characteristics. The methods applied in the study turned out to be relevant to describe the gender gap over the entire income range. Finally, the results of the empirical analysis are discussed to reveal similarities and substantial differences between the countries.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}