{"title":"Integrative Learning of Structured High-Dimensional Data from Multiple Datasets.","authors":"Changgee Chang, Zongyu Dai, Jihwan Oh, Qi Long","doi":"10.1002/sam.11601","DOIUrl":"10.1002/sam.11601","url":null,"abstract":"<p><p>Integrative learning of multiple datasets has the potential to mitigate the challenge of small <i>n</i> and large <i>p</i> that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 2","pages":"120-134"},"PeriodicalIF":1.3,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10195070/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9511811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale affinities with missing data: Estimation and applications.","authors":"Min Zhang, Gal Mishne, Eric C Chi","doi":"10.1002/sam.11561","DOIUrl":"10.1002/sam.11561","url":null,"abstract":"<p><p>Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 3","pages":"303-313"},"PeriodicalIF":2.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9216212/pdf/nihms-1751214.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9560853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sample Selection Bias in Evaluation of Prediction Performance of Causal Models.","authors":"James P Long, Min Jin Ha","doi":"10.1002/sam.11559","DOIUrl":"https://doi.org/10.1002/sam.11559","url":null,"abstract":"<p><p>Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 1","pages":"5-14"},"PeriodicalIF":1.3,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9053600/pdf/nihms-1746637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments.","authors":"Lu Zhang, Abhirup Datta, Sudipto Banerjee","doi":"10.1002/sam.11413","DOIUrl":"https://doi.org/10.1002/sam.11413","url":null,"abstract":"<p><p>With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"12 3","pages":"197-209"},"PeriodicalIF":1.3,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11413","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10297504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}