Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

Neural interval‐censored survival regression with feature selection 带特征选择的神经区间删失生存回归

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2024-07-16 DOI: 10.1002/sam.11704

Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok

{"title":"Neural interval‐censored survival regression with feature selection","authors":"Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok","doi":"10.1002/sam.11704","DOIUrl":"https://doi.org/10.1002/sam.11704","url":null,"abstract":"Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"87 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141642725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design 钼与钨惯性约束聚变双壳靶设计的贝叶斯批量优化

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2024-06-01 DOI: 10.1002/sam.11698

N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley

{"title":"Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design","authors":"N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley","doi":"10.1002/sam.11698","DOIUrl":"https://doi.org/10.1002/sam.11698","url":null,"abstract":"Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141402295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis 用于多途径分析的半参数多核机器回归中的高斯过程选择

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2024-06-01 DOI: 10.1002/sam.11699

Jiali Lin, Inyoung Kim

{"title":"Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis","authors":"Jiali Lin, Inyoung Kim","doi":"10.1002/sam.11699","DOIUrl":"https://doi.org/10.1002/sam.11699","url":null,"abstract":"Analyzing\u0000correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141409881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Confidence bounds for threshold similarity graph in random variable network 随机变量网络中阈值相似图的置信度

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-09-07 DOI: 10.1002/sam.11642

P. Koldanov, A. Koldanov, D. P. Semenov

引用次数: 0

An Improved D2GAN‐based oversampling algorithm for imbalanced data classification 一种改进的基于D2GAN的不平衡数据分类过采样算法

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-08-25 DOI: 10.1002/sam.11640

Xiaoqiang Zhao, Qi Yao

引用次数: 0

A neutral zone classifier for three classes with an application to text mining 一个用于三个类的中性区域分类器，用于文本挖掘

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-08-21 DOI: 10.1002/sam.11639

Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass

引用次数: 0

Ensemble learning for score likelihood ratios under the common source problem 共源问题下分数似然比的集成学习

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-08-04 DOI: 10.1002/sam.11637

Federico Veneri, Danica M. Ommen

{"title":"Ensemble learning for score likelihood ratios under the common source problem","authors":"Federico Veneri, Danica M. Ommen","doi":"10.1002/sam.11637","DOIUrl":"https://doi.org/10.1002/sam.11637","url":null,"abstract":"Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125485310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CLADAG 2021 special issue: Selected papers on classification and data analysis CLADAG 2021特刊:分类与数据分析论文精选

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-07-04 DOI: 10.1002/sam.11633

C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio

{"title":"CLADAG 2021 special issue: Selected papers on classification and data analysis","authors":"C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio","doi":"10.1002/sam.11633","DOIUrl":"https://doi.org/10.1002/sam.11633","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132268384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A deep learning factor analysis model based on importance‐weighted variational inference and normalizing flow priors: Evaluation within a set of multidimensional performance assessments in youth elite soccer players 基于重要性加权变分推理和归一化流先验的深度学习因素分析模型:青少年精英足球运动员多维表现评估中的评价

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-06-30 DOI: 10.1002/sam.11632

P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava

{"title":"A deep learning factor analysis model based on importance‐weighted variational inference and normalizing flow priors: Evaluation within a set of multidimensional performance assessments in youth elite soccer players","authors":"P. Kilian, Daniel Leyhr, Christopher J. Urban, O. Höner, A. Kelava","doi":"10.1002/sam.11632","DOIUrl":"https://doi.org/10.1002/sam.11632","url":null,"abstract":"Exploratory factor analysis is a widely used framework in the social and behavioral sciences. Since measurement errors are always present in human behavior data, latent factors, generating the observed data, are important to identify. While most factor analysis methods rely on linear relationships in the data‐generating process, deep learning models can provide more flexible modeling approaches. However, two problems need to be addressed. First, for interpretation, scaling assumptions are required, which can be (at least) cumbersome for deep generative models. Second, deep generative models are typically not identifiable, which is required in order to identify the underlying latent constructs. We developed a model that uses a variational autoencoder as an estimator for a complex factor analysis model based on importance‐weighted variational inference. In order to receive interpretable results and an identified model, we use a linear factor model with identification constraints in the measurement model. To maintain the flexibility of the model, we use normalizing flow latent priors. Within the evaluation of performance measures in a talent development program in soccer, we found more clarity in the separation of the identified underlying latent dimensions with our models compared to traditional PCA analyses.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new parametric approach to gender gap with application to EUSILC data in Poland and Italy 对性别差距的新参数化方法及其在波兰和意大利的应用

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2023-05-15 DOI: 10.1002/sam.11623

F. Greselin, Alina Jȩdrzejczak, Kamila Trzcińska

{"title":"A new parametric approach to gender gap with application to EUSILC data in Poland and Italy","authors":"F. Greselin, Alina Jȩdrzejczak, Kamila Trzcińska","doi":"10.1002/sam.11623","DOIUrl":"https://doi.org/10.1002/sam.11623","url":null,"abstract":"Real income distribution comparisons are of interest to policy makers across European countries. Nowadays, a crucial component of income inequality remains the discrepancy between men and women, often called the gender gap. Since the gender gap is related to the whole distribution of incomes in a population, popular single metrics are not adequate, and previous studies applied the relative distribution method, a non‐parametric approach to the comparison of distributions. Here, we propose a parametric approach for estimating the relative distribution. Then we extend it to assess the impact of selected covariates—related to the personal characteristics of the samples—on the existing gender gap in both countries. In more detail, models for income were fitted to empirical data from Poland and Italy, from the European Survey of Income and Living Conditions (wave 2018). Afterwards, their parameters were employed to obtain the estimates of relative distribution characteristics. The methods applied in the study turned out to be relevant to describe the gender gap over the entire income range. Finally, the results of the empirical analysis are discussed to reveal similarities and substantial differences between the countries.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0