Statistical Applications in Genetics and Molecular Biology最新文献_第5页

A time warping approach to multiple sequence alignment. 多序列比对的时间规整方法。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0043

Ana Arribas-Gil, Catherine Matias

引用次数: 1

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data. 没有计数，没有方差:在评估RNA-seq数据的生物变异性时，考虑到自由度的损失。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2017-0010

Aaron T L Lun, Gordon K Smyth

{"title":"No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data.","authors":"Aaron T L Lun, Gordon K Smyth","doi":"10.1515/sagmb-2017-0010","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0010","url":null,"abstract":"Abstract RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"83-93"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35075863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

A Bayesian semiparametric factor analysis model for subtype identification. 一种用于亚型识别的贝叶斯半参数因子分析模型。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0051

Jiehuan Sun, Joshua L Warren, Hongyu Zhao

{"title":"A Bayesian semiparametric factor analysis model for subtype identification.","authors":"Jiehuan Sun, Joshua L Warren, Hongyu Zhao","doi":"10.1515/sagmb-2016-0051","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0051","url":null,"abstract":"Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"145-158"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0051","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34856667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Missing value imputation for gene expression data by tailored nearest neighbors. 基因表达数据的缺失值估算方法。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2015-0098

Shahla Faisal, Gerhard Tutz

{"title":"Missing value imputation for gene expression data by tailored nearest neighbors.","authors":"Shahla Faisal, Gerhard Tutz","doi":"10.1515/sagmb-2015-0098","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0098","url":null,"abstract":"High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"95-106"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0098","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35071043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Robin Hood: A cost-efficient two-stage approach to large-scale simultaneous inference with non-homogeneous sparse effects. 罗宾汉:具有非均匀稀疏效应的大规模同时推理的一种经济高效的两阶段方法。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-04-25 DOI: 10.1515/sagmb-2016-0039

Jakub Pecanka, Jelle Goeman

{"title":"Robin Hood: A cost-efficient two-stage approach to large-scale simultaneous inference with non-homogeneous sparse effects.","authors":"Jakub Pecanka, Jelle Goeman","doi":"10.1515/sagmb-2016-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0039","url":null,"abstract":"A classical approach to experimental design in many scientific fields is to first gather all of the data and then analyze it in a single analysis. It has been recognized that in many areas such practice leaves substantial room for improvement in terms of the researcher's ability to identify relevant effects, in terms of cost efficiency, or both. Considerable attention has been paid in recent years to multi-stage designs, in which the user alternates between data collection and analysis and thereby sequentially reduces the size of the problem. However, the focus has generally been towards designs that require a hypothesis be tested in every single stage before it can be declared as rejected by the procedure. Such procedures are well-suited for homogeneous effects, i.e. effects of (almost) equal sizes, however, with effects of varying size a procedure that permits rejection at interim stages is much more suitable. Here we present precisely such multi-stage testing procedure called Robin Hood. We show that with heterogeneous effects our method substantially improves on the existing multi-stage procedures with an essentially zero efficiency trade-off in the homogeneous effect realm, which makes it especially useful in areas such as genetics, where heterogeneous effects are common. Our method improves on existing approaches in a number of ways including a novel way of performing two-sided testing in a multi-stage procedure with increased power for detecting small effects.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 2","pages":"107-132"},"PeriodicalIF":0.9,"publicationDate":"2017-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35075861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bivariate Poisson models with varying offsets: an application to the paired mitochondrial DNA dataset. 具有不同偏移量的双变量泊松模型:对配对线粒体DNA数据集的应用。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0040

Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr

{"title":"Bivariate Poisson models with varying offsets: an application to the paired mitochondrial DNA dataset.","authors":"Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr","doi":"10.1515/sagmb-2016-0040","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0040","url":null,"abstract":"To assess the effect of chemotherapy on mitochondrial genome mutations in cancer survivors and their offspring, a study sequenced the full mitochondrial genome and determined the mitochondrial DNA heteroplasmic (mtDNA) mutation rate. To build a model for counts of heteroplasmic mutations in mothers and their offspring, bivariate Poisson regression was used to examine the relationship between mutation count and clinical information while accounting for the paired correlation. However, if the sequencing depth is not adequate, a limited fraction of the mtDNA will be available for variant calling. The classical bivariate Poisson regression model treats the offset term as equal within pairs; thus, it cannot be applied directly. In this research, we propose an extended bivariate Poisson regression model that has a more general offset term to adjust the length of the accessible genome for each observation. We evaluate the performance of the proposed method with comprehensive simulations, and the results show that the regression model provides unbiased parameter estimations. The use of the model is also demonstrated using the paired mtDNA dataset.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"47-58"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0040","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34774281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Comparison and visualisation of agreement for paired lists of rankings. 比较和可视化的协议配对列表的排名。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0036

Margaret R Donald, Susan R Wilson

{"title":"Comparison and visualisation of agreement for paired lists of rankings.","authors":"Margaret R Donald, Susan R Wilson","doi":"10.1515/sagmb-2016-0036","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0036","url":null,"abstract":"Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance. Generally only a relatively small subset of the k top-ranked items will be in agreement. So the aim is to find the point k at which the probability of agreement in rankings changes from being greater than 0.5 to being less than 0.5. We use penalized splines and a Bayesian logit model, to give a nonparametric smooth to the sequence of agreements, as well as pointwise credible intervals for the probability of agreement. Our approach produces a point estimate and a credible interval for k. R code is provided. The method is applied to rankings of genes from breast cancer microarray experiments.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"16 1","pages":"31-45"},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34803753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Binary Markov Random Fields and interpretable mass spectra discrimination. 二元马尔可夫随机场和可解释质谱鉴别。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2017-02-11 DOI: 10.1515/sagmb-2016-0019

Ao Kong, Robert Azencott

{"title":"Binary Markov Random Fields and interpretable mass spectra discrimination.","authors":"Ao Kong, Robert Azencott","doi":"10.1515/sagmb-2016-0019","DOIUrl":"10.1515/sagmb-2016-0019","url":null,"abstract":"For mass spectra acquired from cancer patients by MALDI or SELDI techniques, automated discrimination between cancer types or stages has often been implemented by machine learning algorithms. Nevertheless, these techniques typically lack interpretability in terms of biomarkers. In this paper, we propose a new mass spectra discrimination algorithm by parameterized Markov Random Fields to automatically generate interpretable classifiers with small groups of scored biomarkers. A dataset of 238 MALDI colorectal mass spectra and two datasets of 216 and 253 SELDI ovarian mass spectra respectively were used to test our approach. The results show that our approach reaches accuracies of 81% to 100% to discriminate between patients from different colorectal and ovarian cancer stages, and performs as well or better than previous studies on similar datasets. Moreover, our approach enables efficient planar-displays to visualize mass spectra discrimination and has good asymptotic performance for large datasets. Thus, our classifiers should facilitate the choice and planning of further experiments for biological interpretation of cancer discriminating signatures. In our experiments, the number of mass spectra for each colorectal cancer stage is roughly half of that for each ovarian cancer stage, so that we reach lower discrimination accuracy for colorectal cancer than for ovarian cancer.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2017-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34969665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tree-based quantitative trait mapping in the presence of external covariates. 存在外部协变量的基于树的定量性状映射。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2015-0107

Katherine L Thompson, Catherine R Linnen, Laura Kubatko

{"title":"Tree-based quantitative trait mapping in the presence of external covariates.","authors":"Katherine L Thompson, Catherine R Linnen, Laura Kubatko","doi":"10.1515/sagmb-2015-0107","DOIUrl":"https://doi.org/10.1515/sagmb-2015-0107","url":null,"abstract":"A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals. Here, we propose a method that can be used to analyze a variety of data types, such as data including external covariates, while considering the evolutionary history among SNPs, providing an advantage over existing methods. Existing methods either do so at a computational cost, or fail to model these relationships altogether. By considering the broad-scale relationships among SNPs, the proposed approach is both computationally-feasible and informed by the evolutionary history among SNPs. We show that incorporating an approximate covariance structure during analysis of complex data sets increases performance in quantitative trait mapping, and apply the proposed method to deer mice data.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 6","pages":"473-490"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2015-0107","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39980704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Adaptive input data transformation for improved network reconstruction with information theoretic algorithms. 基于信息理论的自适应输入数据转换改进网络重构算法。

IF 0.9 4区数学

Statistical Applications in Genetics and Molecular Biology Pub Date : 2016-12-01 DOI: 10.1515/sagmb-2016-0013

Venkateshan Kannan, Jesper Tegner

{"title":"Adaptive input data transformation for improved network reconstruction with information theoretic algorithms.","authors":"Venkateshan Kannan, Jesper Tegner","doi":"10.1515/sagmb-2016-0013","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0013","url":null,"abstract":"We propose a novel systematic procedure of non-linear data transformation for an adaptive algorithm in the context of network reverse-engineering using information theoretic methods. Our methodology is rooted in elucidating and correcting for the specific biases in the estimation techniques for mutual information (MI) given a finite sample of data. These are, in turn, tied to lack of well-defined bounds for numerical estimation of MI for continuous probability distributions from finite data. The nature and properties of the inevitable bias is described, complemented by several examples illustrating their form and variation. We propose an adaptive partitioning scheme for MI estimation that effectively transforms the sample data using parameters determined from its local and global distribution guaranteeing a more robust and reliable reconstruction algorithm. Together with a normalized measure (Shared Information Metric) we report considerably enhanced performance both for in silico and real-world biological networks. We also find that the recovery of true interactions is in particular better for intermediate range of false positive rates, suggesting that our algorithm is less vulnerable to spurious signals of association.","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"15 6","pages":"507-520"},"PeriodicalIF":0.9,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0013","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39981825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0