Improving spliced alignment by modeling splice sites with deep learning.

ArXiv Pub Date : 2025-09-20
Siying Yang, Neng Huang, Heng Li
{"title":"Improving spliced alignment by modeling splice sites with deep learning.","authors":"Siying Yang, Neng Huang, Heng Li","doi":"","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.</p><p><strong>Results: </strong>We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7,026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology.</p><p><strong>Availability and implementation: </strong>https://github.com/lh3/minisplice.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12447723/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Spliced alignment refers to the alignment of messenger RNA (mRNA) or protein sequences to eukaryotic genomes. It plays a critical role in gene annotation and the study of gene functions. Accurate spliced alignment demands sophisticated modeling of splice sites, but current aligners use simple models, which may affect their accuracy given dissimilar sequences.

Results: We implemented minisplice to learn splice signals with a one-dimensional convolutional neural network (1D-CNN) and trained a model with 7,026 parameters for vertebrate and insect genomes. It captures conserved splice signals across phyla and reveals GC-rich introns specific to mammals and birds. We used this model to estimate the empirical splicing probability for every GT and AG in genomes, and modified minimap2 and miniprot to leverage pre-computed splicing probability during alignment. Evaluation on human long-read RNA-seq data and cross-species protein datasets showed our method greatly improves the junction accuracy especially for noisy long RNA-seq reads and proteins of distant homology.

Availability and implementation: https://github.com/lh3/minisplice.

利用深度学习对剪接位点进行建模,改善剪接比对。
动机:剪接比对是指信使RNA (mRNA)或蛋白质序列与真核生物基因组的比对。它在基因注释和基因功能研究中起着至关重要的作用。准确的剪接比对需要复杂的剪接位点建模,但目前的剪接比对器使用的模型简单,这可能会影响到不同序列的剪接比对精度。结果:我们利用一维卷积神经网络(1D-CNN)实现了对拼接信号的学习,并训练了一个包含7026个参数的脊椎动物和昆虫基因组模型。它捕获了跨门的保守剪接信号,揭示了哺乳动物和鸟类特有的富含gc的内含子。我们利用该模型估计了基因组中每个GT和AG的经验剪接概率,并修改了minimap2和miniprot以利用预先计算的剪接概率。对人类长读RNA-seq数据和跨物种蛋白质数据集的评估表明,我们的方法极大地提高了连接精度,特别是对于嘈杂的长读RNA-seq和远同源蛋白。可用性和实现:https://github.com/lh3/minisplice。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信