{"title":"利用定向准计量距离量化基因家族的信息。","authors":"","doi":"10.1016/j.biosystems.2024.105256","DOIUrl":null,"url":null,"abstract":"<div><p>A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space <span><math><mrow><mi>X</mi></mrow></math></span> as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on <span><math><mrow><mi>X</mi></mrow></math></span> between the prior distribution of data and the empirical distribution of the sample. A number of distances on <span><math><mrow><mi>X</mi></mrow></math></span> are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on <span><math><mrow><mi>X</mi></mrow></math></span> based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a <em>quasi-metric</em> on <span><math><mrow><mi>X</mi></mrow></math></span><strong><em>,</em></strong> with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.</p></div>","PeriodicalId":50730,"journal":{"name":"Biosystems","volume":"243 ","pages":"Article 105256"},"PeriodicalIF":2.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0303264724001412/pdfft?md5=8ca4ad1b80b24baedfa49920223c3ee7&pid=1-s2.0-S0303264724001412-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Use of directed quasi-metric distances for quantifying the information of gene families\",\"authors\":\"\",\"doi\":\"10.1016/j.biosystems.2024.105256\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space <span><math><mrow><mi>X</mi></mrow></math></span> as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on <span><math><mrow><mi>X</mi></mrow></math></span> between the prior distribution of data and the empirical distribution of the sample. A number of distances on <span><math><mrow><mi>X</mi></mrow></math></span> are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on <span><math><mrow><mi>X</mi></mrow></math></span> based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a <em>quasi-metric</em> on <span><math><mrow><mi>X</mi></mrow></math></span><strong><em>,</em></strong> with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.</p></div>\",\"PeriodicalId\":50730,\"journal\":{\"name\":\"Biosystems\",\"volume\":\"243 \",\"pages\":\"Article 105256\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0303264724001412/pdfft?md5=8ca4ad1b80b24baedfa49920223c3ee7&pid=1-s2.0-S0303264724001412-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biosystems\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0303264724001412\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biosystems","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0303264724001412","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
Use of directed quasi-metric distances for quantifying the information of gene families
A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on between the prior distribution of data and the empirical distribution of the sample. A number of distances on are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on , with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.
期刊介绍:
BioSystems encourages experimental, computational, and theoretical articles that link biology, evolutionary thinking, and the information processing sciences. The link areas form a circle that encompasses the fundamental nature of biological information processing, computational modeling of complex biological systems, evolutionary models of computation, the application of biological principles to the design of novel computing systems, and the use of biomolecular materials to synthesize artificial systems that capture essential principles of natural biological information processing.