Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods

IF 4.3 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Sebu Aboma Temesgen , Basharat Ahmad , Bakanina Kissanga Grace-Mercure , Minghao Liu , Li Liu , Hao Lin , Kejun Deng
{"title":"Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods","authors":"Sebu Aboma Temesgen ,&nbsp;Basharat Ahmad ,&nbsp;Bakanina Kissanga Grace-Mercure ,&nbsp;Minghao Liu ,&nbsp;Li Liu ,&nbsp;Hao Lin ,&nbsp;Kejun Deng","doi":"10.1016/j.ymeth.2025.03.023","DOIUrl":null,"url":null,"abstract":"<div><div>The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.</div></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"240 ","pages":"Pages 165-179"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202325001069","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.
基于机器学习方法,利用信息熵和编码序列的核苷酸组成特征探索物种分类王国
遗传信息从DNA到蛋白质的流动是由分子生物学的中心法则控制的。遗传漂变和突变通常导致DNA组成的改变,从而影响编码功能蛋白的编码序列(CDS)。分析物种编码区的核苷酸分布对了解物种的进化至关重要。在这项研究中,我们应用马尔可夫过程分析了37,031,061个CDSs的密码子形成,这些CDSs横跨3735个物种的基因组,包括病毒、古细菌、细菌和真核生物,以探索其组成变化。我们的研究结果揭示了物种对不同核苷酸的偏好。信息熵和马尔可夫信息密度表明真核生物具有较高的冗余性,其次是病毒,这表明真核生物中存在较多的基因重复,而病毒中存在较高的突变率。进化趋势表现为信息熵的增加和马尔可夫熵的减少,一阶和二阶马尔可夫信息密度呈负相关。此外,采用均匀流形逼近和投影(UMAP)来减少信息冗余,以揭示物种分类中独特的进化模式。机器学习方法在物种分类精度方面表现优异,为CDS进化和蛋白质合成提供了深刻的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Methods
Methods 生物-生化研究方法
CiteScore
9.80
自引率
2.10%
发文量
222
审稿时长
11.3 weeks
期刊介绍: Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信