{"title":"RNA-seq分类的贝叶斯多元泊松模型","authors":"J. Knight, I. Ivanov, E. Dougherty","doi":"10.1109/GENSIPS.2013.6735946","DOIUrl":null,"url":null,"abstract":"High dimensional data and small samples make genomic/proteomic classifier design and error estimation virtually impossible without the use of prior information [1]. Dalton and Dougherty utilize prior biological knowledge via a Bayesian approach that considers a prior distribution on an uncertainty class of feature-label distributions [2], [3]. While their general framework is very broad, the focus their attention on multinomial and Gaussian models, for which they derive closed-form solutions of the minimum mean squared error (MMSE) error estimate, the MSE of the error estimate, and an optimal Bayesian classifier (OBC) classifier relative to the prior distribution. Sequencing datasets consist of the number of reads found to map to specific regions of a reference genome. As such, they are often modeled with a discrete distribution, such as the Poisson. For this reason, Gaussian and multinomial distributions are not ideal for sequence-based datasets. Thus, we introduce a multivariate Poisson model (MP) and the associated MP OBC for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior classification performance for more complex synthetic datasets and comparable performance to the top classifiers in other simpler synthetic datasets.","PeriodicalId":336511,"journal":{"name":"2013 IEEE International Workshop on Genomic Signal Processing and Statistics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bayesian multivariate Poisson model for RNA-seq classification\",\"authors\":\"J. Knight, I. Ivanov, E. Dougherty\",\"doi\":\"10.1109/GENSIPS.2013.6735946\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High dimensional data and small samples make genomic/proteomic classifier design and error estimation virtually impossible without the use of prior information [1]. Dalton and Dougherty utilize prior biological knowledge via a Bayesian approach that considers a prior distribution on an uncertainty class of feature-label distributions [2], [3]. While their general framework is very broad, the focus their attention on multinomial and Gaussian models, for which they derive closed-form solutions of the minimum mean squared error (MMSE) error estimate, the MSE of the error estimate, and an optimal Bayesian classifier (OBC) classifier relative to the prior distribution. Sequencing datasets consist of the number of reads found to map to specific regions of a reference genome. As such, they are often modeled with a discrete distribution, such as the Poisson. For this reason, Gaussian and multinomial distributions are not ideal for sequence-based datasets. Thus, we introduce a multivariate Poisson model (MP) and the associated MP OBC for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior classification performance for more complex synthetic datasets and comparable performance to the top classifiers in other simpler synthetic datasets.\",\"PeriodicalId\":336511,\"journal\":{\"name\":\"2013 IEEE International Workshop on Genomic Signal Processing and Statistics\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 IEEE International Workshop on Genomic Signal Processing and Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GENSIPS.2013.6735946\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Workshop on Genomic Signal Processing and Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GENSIPS.2013.6735946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bayesian multivariate Poisson model for RNA-seq classification
High dimensional data and small samples make genomic/proteomic classifier design and error estimation virtually impossible without the use of prior information [1]. Dalton and Dougherty utilize prior biological knowledge via a Bayesian approach that considers a prior distribution on an uncertainty class of feature-label distributions [2], [3]. While their general framework is very broad, the focus their attention on multinomial and Gaussian models, for which they derive closed-form solutions of the minimum mean squared error (MMSE) error estimate, the MSE of the error estimate, and an optimal Bayesian classifier (OBC) classifier relative to the prior distribution. Sequencing datasets consist of the number of reads found to map to specific regions of a reference genome. As such, they are often modeled with a discrete distribution, such as the Poisson. For this reason, Gaussian and multinomial distributions are not ideal for sequence-based datasets. Thus, we introduce a multivariate Poisson model (MP) and the associated MP OBC for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior classification performance for more complex synthetic datasets and comparable performance to the top classifiers in other simpler synthetic datasets.