利用对比优化增强基因组分析中的核苷酸序列表征。

IF 5.2 1区 生物学 Q1 BIOLOGY
Mohammadsaleh Refahi, Bahrad A Sokhansanj, Joshua C Mell, James R Brown, Hyunwoo Yoo, Gavin Hearne, Gail L Rosen
{"title":"利用对比优化增强基因组分析中的核苷酸序列表征。","authors":"Mohammadsaleh Refahi, Bahrad A Sokhansanj, Joshua C Mell, James R Brown, Hyunwoo Yoo, Gavin Hearne, Gail L Rosen","doi":"10.1038/s42003-025-07902-6","DOIUrl":null,"url":null,"abstract":"<p><p>Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.</p>","PeriodicalId":10552,"journal":{"name":"Communications Biology","volume":"8 1","pages":"517"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11953366/pdf/","citationCount":"0","resultStr":"{\"title\":\"Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization.\",\"authors\":\"Mohammadsaleh Refahi, Bahrad A Sokhansanj, Joshua C Mell, James R Brown, Hyunwoo Yoo, Gavin Hearne, Gail L Rosen\",\"doi\":\"10.1038/s42003-025-07902-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.</p>\",\"PeriodicalId\":10552,\"journal\":{\"name\":\"Communications Biology\",\"volume\":\"8 1\",\"pages\":\"517\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11953366/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communications Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1038/s42003-025-07902-6\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1038/s42003-025-07902-6","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

基因组和宏基因组序列的分析本身就比氨基酸序列的分析更具挑战性,因为进化相关的核苷酸序列之间存在更高的差异,不同物种基因组内部和基因组之间存在可变k-mer和密码子的使用,以及对选择约束的理解不足。我们介绍了Scorpio (Sequence contrative Optimization for Representation and Predictive Inference on DNA),这是一个为核苷酸序列设计的通用框架,采用对比学习来改进嵌入。通过利用预先训练的基因组语言模型和k-mer频率嵌入,Scorpio在多种应用中表现出竞争力,包括分类和基因分类、抗菌素耐药性(AMR)基因鉴定和启动子检测。天蝎座的一个关键优势是它能够推广到新的DNA序列和分类群,解决了基于比对方法的重大限制。Scorpio已经在不同长度(长和短)的DNA序列的多个数据集上进行了测试,并显示出强大的推断能力。此外,我们还分析了这种表达背后的生物学信息,包括密码子适应指数作为基因表达因子、序列相似性和分类之间的关系,以及基因的功能和结构信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization.

Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Communications Biology
Communications Biology Medicine-Medicine (miscellaneous)
CiteScore
8.60
自引率
1.70%
发文量
1233
审稿时长
13 weeks
期刊介绍: Communications Biology is an open access journal from Nature Research publishing high-quality research, reviews and commentary in all areas of the biological sciences. Research papers published by the journal represent significant advances bringing new biological insight to a specialized area of research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信