Statistical Analysis and Data Mining最新文献

筛选
英文 中文
Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data 用于整合和分析多平台高维基因组学数据的贝叶斯收缩模型
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-04-06 DOI: 10.1002/sam.11682
Hao Xue, Sounak Chakraborty, Tanujit Dey
{"title":"Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data","authors":"Hao Xue, Sounak Chakraborty, Tanujit Dey","doi":"10.1002/sam.11682","DOIUrl":"https://doi.org/10.1002/sam.11682","url":null,"abstract":"With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140603167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expert‐in‐the‐loop design of integral nuclear data experiments 核数据整体实验的在环专家设计
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-04-02 DOI: 10.1002/sam.11677
Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel
{"title":"Expert‐in‐the‐loop design of integral nuclear data experiments","authors":"Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel","doi":"10.1002/sam.11677","DOIUrl":"https://doi.org/10.1002/sam.11677","url":null,"abstract":"Nuclear data are fundamental inputs to radiation transport codes used for reactor design and criticality safety. The design of experiments to reduce nuclear data uncertainty has been a challenge for many years, but advances in the sensitivity calculations of radiation transport codes within the last two decades have made optimal experimental design possible. The design of integral nuclear experiments poses numerous challenges not emphasized in classical optimal design, in particular, constrained design spaces (in both a statistical and engineering sense), severely under‐determined systems, and optimality uncertainty. We present a design pipeline to optimize critical experiments that uses constrained Bayesian optimization within an iterative expert‐in‐the‐loop framework. We show a successfully completed experiment campaign designed with this framework that involved two critical configurations and multiple measurements that targeted compensating errors in <jats:sup>239</jats:sup>Pu nuclear data.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hub‐aware random walk graph embedding methods for classification 用于分类的中枢感知随机漫步图嵌入方法
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-04-01 DOI: 10.1002/sam.11676
Aleksandar Tomčić, Miloš Savić, Miloš Radovanović
{"title":"Hub‐aware random walk graph embedding methods for classification","authors":"Aleksandar Tomčić, Miloš Savić, Miloš Radovanović","doi":"10.1002/sam.11676","DOIUrl":"https://doi.org/10.1002/sam.11676","url":null,"abstract":"In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications 分布尾部的有限混合模型:蒙特卡罗实验和经验应用
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11671
Marilena Furno, Francesco Caracciolo
{"title":"The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications","authors":"Marilena Furno, Francesco Caracciolo","doi":"10.1002/sam.11671","DOIUrl":"https://doi.org/10.1002/sam.11671","url":null,"abstract":"The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smart data augmentation: One equation is all you need 智能数据增强:只需一个等式
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11672
Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma
{"title":"Smart data augmentation: One equation is all you need","authors":"Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma","doi":"10.1002/sam.11672","DOIUrl":"https://doi.org/10.1002/sam.11672","url":null,"abstract":"Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Compositional variable selection in quantile regression for microbiome data with false discovery rate control 微生物组数据量化回归中的组成变量选择与错误发现率控制
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11674
Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan
{"title":"Compositional variable selection in quantile regression for microbiome data with false discovery rate control","authors":"Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan","doi":"10.1002/sam.11674","DOIUrl":"https://doi.org/10.1002/sam.11674","url":null,"abstract":"Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases 高斯过程模型的非均匀主动学习及其在轨迹信息空气动力学数据库中的应用
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-27 DOI: 10.1002/sam.11675
Kevin R. Quinlan, Jagadeesh Movva, Brad Perfect
{"title":"Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases","authors":"Kevin R. Quinlan, Jagadeesh Movva, Brad Perfect","doi":"10.1002/sam.11675","DOIUrl":"https://doi.org/10.1002/sam.11675","url":null,"abstract":"The ability to non‐uniformly weight the input space is desirable for many applications, and has been explored for space‐filling approaches. Increased interests in linking models, such as in a digital twinning framework, increases the need for sampling emulators where they are most likely to be evaluated. In particular, we apply non‐uniform sampling methods for the construction of aerodynamic databases. This paper combines non‐uniform weighting with active learning for Gaussian Processes (GPs) to develop a closed‐form solution to a non‐uniform active learning criterion. We accomplish this by utilizing a kernel density estimator as the weight function. We demonstrate the need and efficacy of this approach with an atmospheric entry example that accounts for both model uncertainty as well as the practical state space of the vehicle, as determined by forward modeling within the active learning loop.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
eRPCA: Robust Principal Component Analysis for Exponential Family Distributions eRPCA:指数族分布的稳健主成分分析
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-27 DOI: 10.1002/sam.11670
Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie
{"title":"eRPCA: Robust Principal Component Analysis for Exponential Family Distributions","authors":"Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie","doi":"10.1002/sam.11670","DOIUrl":"https://doi.org/10.1002/sam.11670","url":null,"abstract":"Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach 非参数量化器在在线手写签名验证中的应用:统计学习方法
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-26 DOI: 10.1002/sam.11673
Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos
{"title":"Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach","authors":"Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos","doi":"10.1002/sam.11673","DOIUrl":"https://doi.org/10.1002/sam.11673","url":null,"abstract":"This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the <jats:italic>x</jats:italic> ‐axis and <jats:italic>y</jats:italic>‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online learning for streaming data classification in nonstationary environments 非稳态环境下流式数据分类的在线学习
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-03-09 DOI: 10.1002/sam.11669
Yujie Gai, Kang Meng, Xiaodi Wang
{"title":"Online learning for streaming data classification in nonstationary environments","authors":"Yujie Gai, Kang Meng, Xiaodi Wang","doi":"10.1002/sam.11669","DOIUrl":"https://doi.org/10.1002/sam.11669","url":null,"abstract":"In this article, we implement the classification of nonstationary streaming data. Due to the inability to obtain full data in the context of streaming data, we adopt a strategy based on clustering structure for data classification. Specifically, this strategy involves dynamically maintaining clustering structures to update the model, thereby updating the objective function for classification. Simultaneously, incoming samples are monitored in real-time to identify the emergence of new classes or the presence of outliers. Moreover, this strategy can also deal with the concept drift problem, where the distribution of data changes with the inflow of data. Regarding the handling of novel instances, we introduce a buffer analysis mechanism to delay their processing, which in turn improves the prediction performance of the model. In the process of model updating, we also introduce a novel renewable strategy for the covariance matrix. Numerical simulations and experiments on datasets show that our method has significant advantages.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140070400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信