Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot
{"title":"通过机器学习从植物基因组中自动管理LTR反转录转座子文库。","authors":"Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot","doi":"10.1515/jib-2021-0036","DOIUrl":null,"url":null,"abstract":"<p><p>Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (<i>Oryza granulata</i>) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.</p>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":" ","pages":""},"PeriodicalIF":16.4000,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/pdf/","citationCount":"0","resultStr":"{\"title\":\"Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.\",\"authors\":\"Simon Orozco-Arias, Mariana S Candamil-Cortes, Paula A Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Gustavo Isaza, Romain Guyot\",\"doi\":\"10.1515/jib-2021-0036\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (<i>Oryza granulata</i>) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.</p>\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2022-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521825/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1515/jib-2021-0036\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2022/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/jib-2021-0036","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/9/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.
Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.
期刊介绍:
Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance.
Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.