{"title":"利用定向准计量距离量化基因家族的信息。","authors":"","doi":"10.1016/j.biosystems.2024.105256","DOIUrl":null,"url":null,"abstract":"<div><p>A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space <span><math><mrow><mi>X</mi></mrow></math></span> as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on <span><math><mrow><mi>X</mi></mrow></math></span> between the prior distribution of data and the empirical distribution of the sample. A number of distances on <span><math><mrow><mi>X</mi></mrow></math></span> are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on <span><math><mrow><mi>X</mi></mrow></math></span> based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a <em>quasi-metric</em> on <span><math><mrow><mi>X</mi></mrow></math></span><strong><em>,</em></strong> with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.</p></div>","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0303264724001412/pdfft?md5=8ca4ad1b80b24baedfa49920223c3ee7&pid=1-s2.0-S0303264724001412-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Use of directed quasi-metric distances for quantifying the information of gene families\",\"authors\":\"\",\"doi\":\"10.1016/j.biosystems.2024.105256\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space <span><math><mrow><mi>X</mi></mrow></math></span> as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on <span><math><mrow><mi>X</mi></mrow></math></span> between the prior distribution of data and the empirical distribution of the sample. A number of distances on <span><math><mrow><mi>X</mi></mrow></math></span> are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on <span><math><mrow><mi>X</mi></mrow></math></span> based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a <em>quasi-metric</em> on <span><math><mrow><mi>X</mi></mrow></math></span><strong><em>,</em></strong> with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.</p></div>\",\"PeriodicalId\":2,\"journal\":{\"name\":\"ACS Applied Bio Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0303264724001412/pdfft?md5=8ca4ad1b80b24baedfa49920223c3ee7&pid=1-s2.0-S0303264724001412-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Bio Materials\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0303264724001412\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATERIALS SCIENCE, BIOMATERIALS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0303264724001412","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
Use of directed quasi-metric distances for quantifying the information of gene families
A large hindrance to analyzing information in genetic or protein sequence data has been a lack of a mathematical framework for doing so. In this paper, we present a multinomial probability space as a general foundation for multicategory discrete data, where categories refer to variants/alleles of biosequences. The external information that is infused in order to generate a sample of such data is quantified as a distance on between the prior distribution of data and the empirical distribution of the sample. A number of distances on are treated. All of them have an information theoretic interpretation, reflecting the information that the sampling mechanism provides about which variants that have a selective advantage and therefore appear more frequently compared to prior expectations. This includes distances on based on mutual information, conditional mutual information, active information, and functional information. The functional information distance is singled out as particularly useful. It is simple and has intuitive interpretations in terms of 1) a rejection sampling mechanism, where functional entities are retained, whereas non-functional categories are censored, and 2) evolutionary waiting times. The functional information is also a quasi-metric on , with information being measured in an asymmetric, mountainous landscape. This quasi-metric property is also retained for a robustified version of the functional information distance that allows for mutations in the sampling mechanism. The functional information quasi-metric has been applied with success on bioinformatics data sets, for proteins and sequence alignment of protein families.