Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.

IF 1.5 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Journal of Integrative Bioinformatics Pub Date : 2022-07-12 eCollection Date: 2022-09-01 DOI:10.1515/jib-2021-0036
Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot
{"title":"Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.","authors":"Simon Orozco-Arias,&nbsp;Mariana S Candamil-Cortes,&nbsp;Paula A Jaimes,&nbsp;Estiven Valencia-Castrillon,&nbsp;Reinel Tabares-Soto,&nbsp;Gustavo Isaza,&nbsp;Romain Guyot","doi":"10.1515/jib-2021-0036","DOIUrl":null,"url":null,"abstract":"<p><p>Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (<i>Oryza granulata</i>) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrative Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/jib-2021-0036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/9/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.

Abstract Image

Abstract Image

Abstract Image

通过机器学习从植物基因组中自动管理LTR反转录转座子文库。
转座因子是一种可移动的序列,它可以移动并插入到染色体中,在内部或外部刺激下激活,使生物体具有适应环境的能力。在基因组数据中标注转座因子目前被认为是理解生物体关键方面的关键任务,如表型变异性、物种进化和基因组大小等。由于它们复制的方式,LTR逆转录转座子是植物中最常见的转座子,在某些情况下占所有DNA信息的80%。为了标注这些元素,通常创建一个参考文库,执行一个管理过程,消除TE片段和假阳性,然后使用同源性方法在基因组中进行标注。然而,管理过程可能需要数周时间,需要大量的手工工作和多个耗时的生物信息学软件的执行。在这里,我们提出了一种基于机器学习的方法来对植物基因组自动执行这一过程,获得高达91.18%的f1得分。该方法在4种植物中进行了测试,仅用22.61 s就获得了93.6%的f1分数(Oryza granulata),而生物信息学方法大约需要6小时。这表明基于ml的方法是有效的,可以用于大规模测序项目。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Integrative Bioinformatics
Journal of Integrative Bioinformatics Medicine-Medicine (all)
CiteScore
3.10
自引率
5.30%
发文量
27
审稿时长
12 weeks
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信