Are Profile Hidden Markov Models Identifiable?

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233563

Srilakshmi Pattabiraman, T. Warnow

{"title":"Are Profile Hidden Markov Models Identifiable?","authors":"Srilakshmi Pattabiraman, T. Warnow","doi":"10.1145/3233547.3233563","DOIUrl":null,"url":null,"abstract":"Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence a basic question about profile HMMs is whether they are \\em statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Profile Hidden Markov Models (HMMs) are graphical models that can be used to produce finite length sequences from a distribution. In fact, although they were only introduced for bioinformatics 25 years ago (by Haussler et al., Hawaii International Conference on Systems Science 1993), they are arguably the most commonly used statistical model in bioinformatics, with multiple applications, including protein structure and function prediction, classifications of novel proteins into existing protein families and superfamilies, metagenomics, and multiple sequence alignment. The standard use of profile HMMs in bioinformatics has two steps: first a profile HMM is built for a collection of molecular sequences (which may not be in a multiple sequence alignment), and then the profile HMM is used in some subsequent analysis of new molecular sequences. The construction of the profile thus is itself a statistical estimation problem, since any given set of sequences might potentially fit more than one model well. Hence a basic question about profile HMMs is whether they are \em statistically identifiable, which means that no two profile HMMs can produce the same distribution on finite length sequences. Indeed, statistical identifiability is a fundamental aspect of any statistical model, and yet it is not known whether profile HMMs are statistically identifiable. In this paper, we report on preliminary results towards characterizing the statistical identifiability of profile HMMs in one of the standard forms used in bioinformatics.

查看原文本刊更多论文

轮廓隐马尔可夫模型可识别吗?

剖面隐马尔可夫模型(hmm)是一种图形模型，可用于从分布中生成有限长度序列。事实上，尽管它们在25年前才被引入生物信息学(由Haussler等人，1993年夏威夷国际系统科学会议)，但它们可以说是生物信息学中最常用的统计模型，具有多种应用，包括蛋白质结构和功能预测，将新蛋白质分类为现有蛋白质家族和超家族，宏基因组学和多序列定位。在生物信息学中，隐马尔可夫模型的标准使用分为两个步骤:首先，为一组分子序列(可能不是多序列比对)建立隐马尔可夫模型，然后将隐马尔可夫模型用于后续对新分子序列的分析。因此，剖面的构造本身就是一个统计估计问题，因为任何给定的序列集都可能很好地拟合多个模型。因此，关于剖面hmm的一个基本问题是它们是否具有统计可识别性，这意味着没有两个剖面hmm可以在有限长度序列上产生相同的分布。事实上，统计可识别性是任何统计模型的一个基本方面，但目前尚不清楚轮廓hmm是否具有统计可识别性。在本文中，我们报告了在生物信息学中使用的一种标准形式中描述剖面hmm的统计可识别性的初步结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量