Xue‐Chan Tian, Shuai Nie, Douglas Domingues, Alexandre Rossi Paschoal, Li‐Bo Jiang, Jian‐Feng Mao
{"title":"PlantLncBoost: key features for plant lncRNA identification and significant improvement in accuracy and generalization","authors":"Xue‐Chan Tian, Shuai Nie, Douglas Domingues, Alexandre Rossi Paschoal, Li‐Bo Jiang, Jian‐Feng Mao","doi":"10.1111/nph.70211","DOIUrl":null,"url":null,"abstract":"Summary<jats:list list-type=\"bullet\"> <jats:list-item>Long noncoding RNAs (lncRNAs) are critical regulators of numerous biological processes in plants. Nevertheless, their identification is challenging due to the low sequence conservation across various species. Existing computational methods for lncRNA identification often face difficulties in generalizing across diverse plant species, highlighting the need for more robust and versatile identification models.</jats:list-item> <jats:list-item>Here, we present PlantLncBoost, a novel computational tool designed to improve the generalization in plant lncRNA identification. By integrating advanced gradient boosting algorithms with comprehensive feature selection, our approach achieves both high accuracy and generalizability. We conducted an extensive analysis of 1662 features and identified three key features – ORF coverage, complex Fourier average, and atomic Fourier amplitude – that effectively distinguish lncRNAs from mRNAs.</jats:list-item> <jats:list-item>We assessed the performance of PlantLncBoost using comprehensive datasets from 20 plant species. The model exhibited exceptional performance, with an accuracy of 96.63%, a sensitivity of 98.42%, and a specificity of 94.93%, significantly outperforming existing tools. Further analysis revealed that the features we selected effectively capture the differences between lncRNAs and mRNAs across a variety of plant species.</jats:list-item> <jats:list-item>PlantLncBoost represents a significant advancement in plant lncRNA identification. It is freely accessible on GitHub (<jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/xuechantian/PlantLncBoost\">https://github.com/xuechantian/PlantLncBoost</jats:ext-link>) and has been integrated into a comprehensive analysis pipeline, Plant‐LncRNA‐pipeline v.2 (<jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"https://github.com/xuechantian/Plant-LncRNA-pipeline-v2\">https://github.com/xuechantian/Plant‐LncRNA‐pipeline‐v2</jats:ext-link>).</jats:list-item> </jats:list>","PeriodicalId":214,"journal":{"name":"New Phytologist","volume":"35 1","pages":""},"PeriodicalIF":8.3000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Phytologist","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1111/nph.70211","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PLANT SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
SummaryLong noncoding RNAs (lncRNAs) are critical regulators of numerous biological processes in plants. Nevertheless, their identification is challenging due to the low sequence conservation across various species. Existing computational methods for lncRNA identification often face difficulties in generalizing across diverse plant species, highlighting the need for more robust and versatile identification models.Here, we present PlantLncBoost, a novel computational tool designed to improve the generalization in plant lncRNA identification. By integrating advanced gradient boosting algorithms with comprehensive feature selection, our approach achieves both high accuracy and generalizability. We conducted an extensive analysis of 1662 features and identified three key features – ORF coverage, complex Fourier average, and atomic Fourier amplitude – that effectively distinguish lncRNAs from mRNAs.We assessed the performance of PlantLncBoost using comprehensive datasets from 20 plant species. The model exhibited exceptional performance, with an accuracy of 96.63%, a sensitivity of 98.42%, and a specificity of 94.93%, significantly outperforming existing tools. Further analysis revealed that the features we selected effectively capture the differences between lncRNAs and mRNAs across a variety of plant species.PlantLncBoost represents a significant advancement in plant lncRNA identification. It is freely accessible on GitHub (https://github.com/xuechantian/PlantLncBoost) and has been integrated into a comprehensive analysis pipeline, Plant‐LncRNA‐pipeline v.2 (https://github.com/xuechantian/Plant‐LncRNA‐pipeline‐v2).
期刊介绍:
New Phytologist is an international electronic journal published 24 times a year. It is owned by the New Phytologist Foundation, a non-profit-making charitable organization dedicated to promoting plant science. The journal publishes excellent, novel, rigorous, and timely research and scholarship in plant science and its applications. The articles cover topics in five sections: Physiology & Development, Environment, Interaction, Evolution, and Transformative Plant Biotechnology. These sections encompass intracellular processes, global environmental change, and encourage cross-disciplinary approaches. The journal recognizes the use of techniques from molecular and cell biology, functional genomics, modeling, and system-based approaches in plant science. Abstracting and Indexing Information for New Phytologist includes Academic Search, AgBiotech News & Information, Agroforestry Abstracts, Biochemistry & Biophysics Citation Index, Botanical Pesticides, CAB Abstracts®, Environment Index, Global Health, and Plant Breeding Abstracts, and others.