Statistical Analysis and Data Mining最新文献_第3页

Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data 用于整合和分析多平台高维基因组学数据的贝叶斯收缩模型

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-04-06 DOI: 10.1002/sam.11682

Hao Xue, Sounak Chakraborty, Tanujit Dey

{"title":"Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data","authors":"Hao Xue, Sounak Chakraborty, Tanujit Dey","doi":"10.1002/sam.11682","DOIUrl":"https://doi.org/10.1002/sam.11682","url":null,"abstract":"With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"8 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140603167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expert‐in‐the‐loop design of integral nuclear data experiments 核数据整体实验的在环专家设计

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-04-02 DOI: 10.1002/sam.11677

Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel

引用次数: 0

Hub‐aware random walk graph embedding methods for classification 用于分类的中枢感知随机漫步图嵌入方法

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-04-01 DOI: 10.1002/sam.11676

Aleksandar Tomčić, Miloš Savić, Miloš Radovanović

{"title":"Hub‐aware random walk graph embedding methods for classification","authors":"Aleksandar Tomčić, Miloš Savić, Miloš Radovanović","doi":"10.1002/sam.11676","DOIUrl":"https://doi.org/10.1002/sam.11676","url":null,"abstract":"In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"60 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications 分布尾部的有限混合模型：蒙特卡罗实验和经验应用

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11671

Marilena Furno, Francesco Caracciolo

{"title":"The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications","authors":"Marilena Furno, Francesco Caracciolo","doi":"10.1002/sam.11671","DOIUrl":"https://doi.org/10.1002/sam.11671","url":null,"abstract":"The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart data augmentation: One equation is all you need 智能数据增强：只需一个等式

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11672

Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma

{"title":"Smart data augmentation: One equation is all you need","authors":"Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma","doi":"10.1002/sam.11672","DOIUrl":"https://doi.org/10.1002/sam.11672","url":null,"abstract":"Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compositional variable selection in quantile regression for microbiome data with false discovery rate control 微生物组数据量化回归中的组成变量选择与错误发现率控制

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-28 DOI: 10.1002/sam.11674

Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan

{"title":"Compositional variable selection in quantile regression for microbiome data with false discovery rate control","authors":"Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan","doi":"10.1002/sam.11674","DOIUrl":"https://doi.org/10.1002/sam.11674","url":null,"abstract":"Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases 高斯过程模型的非均匀主动学习及其在轨迹信息空气动力学数据库中的应用

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-27 DOI: 10.1002/sam.11675

Kevin R. Quinlan, Jagadeesh Movva, Brad Perfect

引用次数: 0

eRPCA: Robust Principal Component Analysis for Exponential Family Distributions eRPCA：指数族分布的稳健主成分分析

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-27 DOI: 10.1002/sam.11670

Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

{"title":"eRPCA: Robust Principal Component Analysis for Exponential Family Distributions","authors":"Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie","doi":"10.1002/sam.11670","DOIUrl":"https://doi.org/10.1002/sam.11670","url":null,"abstract":"Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"71 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach 非参数量化器在在线手写签名验证中的应用：统计学习方法

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-26 DOI: 10.1002/sam.11673

Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos

{"title":"Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach","authors":"Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos","doi":"10.1002/sam.11673","DOIUrl":"https://doi.org/10.1002/sam.11673","url":null,"abstract":"This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the x ‐axis and y‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online learning for streaming data classification in nonstationary environments 非稳态环境下流式数据分类的在线学习

IF 1.3 4区数学

Statistical Analysis and Data Mining Pub Date : 2024-03-09 DOI: 10.1002/sam.11669

Yujie Gai, Kang Meng, Xiaodi Wang

引用次数: 0