蛋白质是不可压缩的

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.755675

C. Nevill-Manning, I. Witten

{"title":"蛋白质是不可压缩的","authors":"C. Nevill-Manning, I. Witten","doi":"10.1109/DCC.1999.755675","DOIUrl":null,"url":null,"abstract":"Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":"{\"title\":\"Protein is incompressible\",\"authors\":\"C. Nevill-Manning, I. Witten\",\"doi\":\"10.1109/DCC.1999.755675\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"96\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.755675\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.755675","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

摘要

生命基于两种聚合物，DNA和蛋白质，它们的特性可以用一个简单的文本文件来描述。我们很自然地期望标准文本压缩技术能够像处理英语文本一样处理生物序列。但是生物序列与语言序列有着根本不同的结构，标准的压缩方案在它们上面表现得令人失望。我们描述了一种新的压缩方法，考虑到潜在的生化原理。这导致了统计压缩器混合的泛化，其中使用每个上下文，并根据其与当前上下文的相似性进行加权。结果支持生物信息学研究表明，蛋白质中几乎没有马尔可夫依赖性。这削弱了数据压缩方案，并将其减少到零阶模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Protein is incompressible

Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

自引率

0.00%

发文量