通过机器学习从植物基因组中自动管理LTR反转录转座子文库。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics Pub Date : 2022-07-12 eCollection Date: 2022-09-01 DOI:10.1515/jib-2021-0036

Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot

{"title":"通过机器学习从植物基因组中自动管理LTR反转录转座子文库。","authors":"Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot","doi":"10.1515/jib-2021-0036","DOIUrl":null,"url":null,"abstract":"Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/pdf/","citationCount":"0","resultStr":"{\"title\":\"Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.\",\"authors\":\"Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot\",\"doi\":\"10.1515/jib-2021-0036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.\",\"PeriodicalId\":53625,\"journal\":{\"name\":\"Journal of Integrative Bioinformatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2022-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Integrative Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/jib-2021-0036\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Integrative Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/jib-2021-0036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/9/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

转座因子是一种可移动的序列，它可以移动并插入到染色体中，在内部或外部刺激下激活，使生物体具有适应环境的能力。在基因组数据中标注转座因子目前被认为是理解生物体关键方面的关键任务，如表型变异性、物种进化和基因组大小等。由于它们复制的方式，LTR逆转录转座子是植物中最常见的转座子，在某些情况下占所有DNA信息的80%。为了标注这些元素，通常创建一个参考文库，执行一个管理过程，消除TE片段和假阳性，然后使用同源性方法在基因组中进行标注。然而，管理过程可能需要数周时间，需要大量的手工工作和多个耗时的生物信息学软件的执行。在这里，我们提出了一种基于机器学习的方法来对植物基因组自动执行这一过程，获得高达91.18%的f1得分。该方法在4种植物中进行了测试，仅用22.61 s就获得了93.6%的f1分数(Oryza granulata)，而生物信息学方法大约需要6小时。这表明基于ml的方法是有效的，可以用于大规模测序项目。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.

查看原文本刊更多论文

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.

Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Integrative Bioinformatics Medicine-Medicine (all)

CiteScore

3.10

自引率

5.30%

发文量

审稿时长

12 weeks