A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae.

G Sampath
{"title":"A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae.","authors":"G Sampath","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus influenzae, the first fully sequenced genome. The method yields an entropy value of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps and an improvement of 0.452 bps over the best known to date of 4.118 bps using Matsumoto, Sadakane, and Imai's lza-CTW algorithm. Calculations based on a compact inverse genetic code show that the genome has a maximum entropy of 1.757 bps for the coding regions, with a possibly lower actual entropy. These results hint at the existence of hitherto unexplored redundancies that do not show up in Markov models and are indicative of more internal structure than suspected in both the protein and the genome.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"287-93"},"PeriodicalIF":0.0000,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus influenzae, the first fully sequenced genome. The method yields an entropy value of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps and an improvement of 0.452 bps over the best known to date of 4.118 bps using Matsumoto, Sadakane, and Imai's lza-CTW algorithm. Calculations based on a compact inverse genetic code show that the genome has a maximum entropy of 1.757 bps for the coding regions, with a possibly lower actual entropy. These results hint at the existence of hitherto unexplored redundancies that do not show up in Markov models and are indicative of more internal structure than suspected in both the protein and the genome.

一种块编码方法,可显著降低流感嗜血杆菌蛋白质和编码部分的熵值。
一个简单的统计块代码与基于lzw的压缩工具gzip和compress相结合,可以显著提高流感嗜血杆菌(第一个完全测序的基因组)中编码的蛋白质的压缩水平。该方法产生的熵值为每符号3.665比特(bps),比最大值4.322 bps低0.657 bps,比迄今为止使用松本、Sadakane和Imai的lza-CTW算法的4.118 bps提高0.452 bps。基于紧凑逆遗传密码的计算表明,基因组编码区域的最大熵为1.757 bps,实际熵可能更低。这些结果暗示了迄今为止未被探索的冗余的存在,这些冗余没有在马尔可夫模型中显示出来,并且表明蛋白质和基因组中的内部结构比怀疑的要多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信