Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods

IF 4.3 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Methods Pub Date : 2025-04-23 DOI:10.1016/j.ymeth.2025.03.023

Sebu Aboma Temesgen , Basharat Ahmad , Bakanina Kissanga Grace-Mercure , Minghao Liu , Li Liu , Hao Lin , Kejun Deng

{"title":"Exploring species taxonomic kingdom using information entropy and nucleotide compositional features of coding sequences based on machine learning methods","authors":"Sebu Aboma Temesgen , Basharat Ahmad , Bakanina Kissanga Grace-Mercure , Minghao Liu , Li Liu , Hao Lin , Kejun Deng","doi":"10.1016/j.ymeth.2025.03.023","DOIUrl":null,"url":null,"abstract":"<div><div>The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.</div></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"240 ","pages":"Pages 165-179"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202325001069","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The flow of genetic information from DNA to protein is governed by the central dogma of molecular biology. Genetic drift and mutations usually lead to changes in DNA composition, thereby affecting the coding sequences (CDS) that encode functional proteins. Analyzing the nucleotide distribution in the coding regions of species is crucial for understanding their evolution. In this study, we applied Markov processes to analyze codon formation in 37,031,061 CDSs across 3,735 species genomes, spanning viruses, archaea, bacteria, and eukaryotes, to explore compositional changes. Our results revealed species preferences for different nucleotides. Information entropies and Markov information densities show that eukaryotes exhibit higher redundancy, followed by viruses, suggesting more gene duplication in eukaryotes and high mutation rates in viruses. Evolutionary trends showed an increase in information entropy and a decrease in Markov entropy, with negative correlations between first- and second-order Markov information densities. Furthermore, uniform manifold approximation and projection (UMAP) was used to reduce information redundancy for revealing unique evolutionary patterns in species classification. The machine learning methods demonstrated excellent performance in species classification accuracy, providing profound insights into CDS evolution and protein synthesis.

查看原文本刊更多论文

基于机器学习方法，利用信息熵和编码序列的核苷酸组成特征探索物种分类王国

遗传信息从DNA到蛋白质的流动是由分子生物学的中心法则控制的。遗传漂变和突变通常导致DNA组成的改变，从而影响编码功能蛋白的编码序列（CDS）。分析物种编码区的核苷酸分布对了解物种的进化至关重要。在这项研究中，我们应用马尔可夫过程分析了37,031,061个CDSs的密码子形成，这些CDSs横跨3735个物种的基因组，包括病毒、古细菌、细菌和真核生物，以探索其组成变化。我们的研究结果揭示了物种对不同核苷酸的偏好。信息熵和马尔可夫信息密度表明真核生物具有较高的冗余性，其次是病毒，这表明真核生物中存在较多的基因重复，而病毒中存在较高的突变率。进化趋势表现为信息熵的增加和马尔可夫熵的减少，一阶和二阶马尔可夫信息密度呈负相关。此外，采用均匀流形逼近和投影（UMAP）来减少信息冗余，以揭示物种分类中独特的进化模式。机器学习方法在物种分类精度方面表现优异，为CDS进化和蛋白质合成提供了深刻的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Methods 生物-生化研究方法

CiteScore

9.80

自引率

2.10%

发文量

222

审稿时长

11.3 weeks

期刊介绍： Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.