Bioinformatics advances最新文献

筛选
英文 中文
Understanding ecological systems using knowledge graphs: an application to highly pathogenic avian influenza. 利用知识图谱理解生态系统:高致病性禽流感的应用。
IF 2.4
Bioinformatics advances Pub Date : 2025-02-05 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf016
Hailey Robertson, Barbara A Han, Adrian A Castellanos, David Rosado, Guppy Stott, Ryan Zimmerman, John M Drake, Ellie Graeden
{"title":"Understanding ecological systems using knowledge graphs: an application to highly pathogenic avian influenza.","authors":"Hailey Robertson, Barbara A Han, Adrian A Castellanos, David Rosado, Guppy Stott, Ryan Zimmerman, John M Drake, Ellie Graeden","doi":"10.1093/bioadv/vbaf016","DOIUrl":"10.1093/bioadv/vbaf016","url":null,"abstract":"<p><strong>Motivation: </strong>Ecological systems are complex. Representing heterogeneous knowledge about ecological systems is a pervasive challenge because data are generated from many subdisciplines, exist in disparate sources, and only capture a subset of interactions underpinning system dynamics. Knowledge graphs (KGs) have been successfully applied to organize heterogeneous data and to predict new linkages in complex systems. Though not previously applied broadly in ecology, KGs have much to offer in an era when system dynamics are responding to rapid changes across multiple scales.</p><p><strong>Results: </strong>We developed a KG to demonstrate the method's utility for ecological problems focused on highly pathogenic avian influenza (HPAI), a highly transmissible virus with a broad host range, wide geographic distribution, and rapid evolution with pandemic potential. We describe the development of a graph to include data related to HPAI including pathogen-host associations, species distributions, and population demographics, using a semantic ontology that defines relationships within and between datasets. We use the graph to perform a set of proof-of-concept analyses validating the method and identifying patterns of HPAI ecology. We underscore the generalizable value of KGs to ecology including ability to reveal previously known relationships and testable hypotheses in support of a deeper mechanistic understanding of ecological systems.</p><p><strong>Availability and implementation: </strong>The data and code are available under the MIT License on GitHub at https://github.com/cghss-data-lab/uga-pipp.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf016"},"PeriodicalIF":2.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879169/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning.
IF 2.4
Bioinformatics advances Pub Date : 2025-02-05 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf019
Shuangjia Lu, Erdal Cosgun
{"title":"Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning.","authors":"Shuangjia Lu, Erdal Cosgun","doi":"10.1093/bioadv/vbaf019","DOIUrl":"10.1093/bioadv/vbaf019","url":null,"abstract":"<p><strong>Motivation: </strong>Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques.</p><p><strong>Results: </strong>Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from five major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains.</p><p><strong>Availability and implementation: </strong>We used publicly available datasets as detailed in the paper, which can be provided upon request.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf019"},"PeriodicalIF":2.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842050/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143470206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate, comprehensive database of group I introns and their homing endonucleases.
IF 2.4
Bioinformatics advances Pub Date : 2025-02-05 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf020
Lara Sellés Vidal, Tomoya Noma, Yohei Yokobayashi
{"title":"Accurate, comprehensive database of group I introns and their homing endonucleases.","authors":"Lara Sellés Vidal, Tomoya Noma, Yohei Yokobayashi","doi":"10.1093/bioadv/vbaf020","DOIUrl":"10.1093/bioadv/vbaf020","url":null,"abstract":"<p><strong>Motivation: </strong>Group I introns are one of the most widely studied ribozymes. Since their initial discovery, a large number of them have been identified experimentally or computationally. However, no comprehensive and unified database that provides group I intron sequences with precise boundaries and structural information is available.</p><p><strong>Results: </strong>We created a new database of group I intron sequences with reliable exon-intron boundaries. The database offers additional data for each sequence, such as containing GenBank entry, its position within the associated entry, the subtype of each intron and putative homing endonucleases. Secondary structure predictions and base-pairing probability matrixes are also provided for each sequence. The resource is expected to facilitate large-scale studies of group I introns, as well as engineering for novel applications.</p><p><strong>Availability and implementation: </strong>The database, as well as the code to generate it and a GUI to facilitate its exploration, are available at https://github.com/LaraSellesVidal/Group1IntronDatabase. The source code for the GUI implementation is available at https://github.com/LaraSellesVidal/OnlineGroup1IntronDatabase. The database can also be accessed online at https://online-group-1-intron-database.onrender.com. Base-pairing probability matrixes are available separately at https://www.ebi.ac.uk/biostudies/studies/S-BSST1399.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf020"},"PeriodicalIF":2.4,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143450987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequali: efficient and comprehensive quality control of short- and long-read sequencing data.
IF 2.4
Bioinformatics advances Pub Date : 2025-01-29 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf010
Ruben H P Vorderman
{"title":"Sequali: efficient and comprehensive quality control of short- and long-read sequencing data.","authors":"Ruben H P Vorderman","doi":"10.1093/bioadv/vbaf010","DOIUrl":"10.1093/bioadv/vbaf010","url":null,"abstract":"<p><strong>Motivation: </strong>Quality control of sequencing data is the first step in many sequencing workflows. Short- and long-read sequencing technologies have many commonalities with regard to quality control. Several quality control programs exist; however, none possess a feature set that is adequate for both technologies. Quality control programs aimed at Oxford Nanopore Technologies sequencing lack vital features, such as adapter searching, overrepresented sequence analysis, and duplication analysis.</p><p><strong>Results: </strong>Sequali was developed to provide sequencing quality control for both short- and long-read sequencing technologies. It features adapter search, overrepresented sequence analysis, and duplication analysis and supports FASTQ and uBAM inputs. It is significantly faster than comparable sequencing quality control programs for both short- and long-read sequencing technologies.</p><p><strong>Availability and implementation: </strong>Sequali is an open-source Python application using C extensions and is freely available under the AGPL-3.0 license at https://github.com/rhpvorderman/sequali. The source code for each release is archived at zenodo: https://zenodo.org/doi/10.5281/zenodo.10822485.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf010"},"PeriodicalIF":2.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802474/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143384225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A bioinformatician, computer scientist, and geneticist lead bioinformatic tool development-which one is better?
IF 2.4
Bioinformatics advances Pub Date : 2025-01-29 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf011
Paul P Gardner
{"title":"A bioinformatician, computer scientist, and geneticist lead bioinformatic tool development-which one is better?","authors":"Paul P Gardner","doi":"10.1093/bioadv/vbaf011","DOIUrl":"10.1093/bioadv/vbaf011","url":null,"abstract":"<p><strong>Motivation: </strong>The development of accurate bioinformatic software tools is crucial for the effective analysis of complex biological data. This study examines the relationship between the academic department affiliations of authors and the accuracy of the bioinformatic tools they develop. By analyzing a corpus of previously benchmarked bioinformatic software tools, we mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field.</p><p><strong>Results: </strong>Our results suggest that \"Medical Informatics\" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation. In contrast, tools developed by authors affiliated with \"Bioinformatics\" and \"Engineering\" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant (<i>P </i>><i> </i>.05). Our findings reveal no strong association between academic field and bioinformatic software accuracy. These findings suggest that the development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.</p><p><strong>Availability and implementation: </strong>All data and the analysis pipeline for this study are freely available online at the GitHub repository: https://github.com/ppgardne/departments-software-accuracy.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf011"},"PeriodicalIF":2.4,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11842046/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143470149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editome Disease Knowledgebase v2.0: an updated resource of editome-disease associations through literature curation and integrative analysis.
IF 2.4
Bioinformatics advances Pub Date : 2025-01-25 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf012
Tongtong Zhu, Yuan Chu, Guangyi Niu, Rong Pan, Ming Chen, Yuanyuan Cheng, Yuansheng Zhang, Zhao Li, Shuai Jiang, Lili Hao, Dong Zou, Tianyi Xu, Zhang Zhang
{"title":"Editome Disease Knowledgebase v2.0: an updated resource of editome-disease associations through literature curation and integrative analysis.","authors":"Tongtong Zhu, Yuan Chu, Guangyi Niu, Rong Pan, Ming Chen, Yuanyuan Cheng, Yuansheng Zhang, Zhao Li, Shuai Jiang, Lili Hao, Dong Zou, Tianyi Xu, Zhang Zhang","doi":"10.1093/bioadv/vbaf012","DOIUrl":"10.1093/bioadv/vbaf012","url":null,"abstract":"<p><strong>Motivation: </strong>Editome Disease Knowledgebase (EDK) is a curated resource of knowledge between RNA editome and human diseases. Since its first release in 2018, a number of studies have discovered previously uncharacterized editome-disease associations and generated an abundance of RNA editing datasets. Thus, it is desirable to make significant updates for EDK by incorporating more editome-disease associations as well as their related editing profiles.</p><p><strong>Results: </strong>Here, we present EDK v2.0, an updated version of editome-disease associations based on both literature curation and integrative analysis. EDK v2.0 incorporates a curated collection of 1097 editome-disease associations involving 115 diseases from 321 publications. Meanwhile, based on a standardized pipeline, EDK v2.0 provides RNA editing profiles from 48 datasets covering 2536 samples across 55 diseases. Through differential analysis on RNA editing, it further identifies a total of 7190 differential edited genes and 86 242 differential editing sites (DESs), leading to 266 339 DES-disease associations. Moreover, a curated list of 28 160 <i>cis</i>-RNA editing QTL associations, 458 187 DES-RNA binding protein associations, and 21 DES-RNA secondary structure associations are annotated and added to EDK v2.0. Additionally, it is equipped with a series of user-friendly tools to facilitate RNA editing online analysis.</p><p><strong>Availability and implementation: </strong>https://ngdc.cncb.ac.cn/edk/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf012"},"PeriodicalIF":2.4,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835235/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OrthoBrowser: gene family analysis and visualization. OrthoBrowser:基因家族分析和可视化。
IF 2.4
Bioinformatics advances Pub Date : 2025-01-23 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf009
Nolan T Hartwick, Todd P Michael
{"title":"OrthoBrowser: gene family analysis and visualization.","authors":"Nolan T Hartwick, Todd P Michael","doi":"10.1093/bioadv/vbaf009","DOIUrl":"10.1093/bioadv/vbaf009","url":null,"abstract":"<p><strong>Motivation: </strong>The analysis of gene families across diverse species is pivotal in elucidating evolutionary dynamics and functional genomic landscapes. Typical analysis approaches often require significant computational expertise and user time.</p><p><strong>Results: </strong>We introduce OrthoBrowser, a static site generator that will index and serve phylogeny, gene trees, multiple sequence alignments, and novel multiple synteny alignments. This greatly enhances the usability of tools like OrthoFinder by making the detailed results much more visually accessible. This interface can scale reasonably up to hundreds of genomes, allows a user to filter this large dataset to a subset of samples they are interested in at that particular moment in time, or \"zoom in\" to only a subtree of the orthogroup. The multiple synteny alignment method uses a progressive hierarchical alignment approach in the protein space using orthogroup membership to establish orthology. Orthobrowser makes it easy for users to identify, interact with, explore, and share key information about their gene families of interest.</p><p><strong>Availability and implementation: </strong>OrthoBrowser is pip installable and is available under MIT license at: https://gitlab.com/salk-tm/orthobrowser. Complete example OrthoBrowser results are available at: https://orthobrowserexamples.netlify.app/.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf009"},"PeriodicalIF":2.4,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11825985/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143434554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AOP-networkFinder-a versatile tool for the reconstruction and visualization of Adverse Outcome Pathway networks from AOP-Wiki.
IF 2.4
Bioinformatics advances Pub Date : 2025-01-22 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf007
Nurettin Yarar, Marvin Martens, Torbjørn Rognes, Jan Lavender, Hubert Dirven, Karine Audouze, Marcin W Wojewodzic
{"title":"AOP-networkFinder-a versatile tool for the reconstruction and visualization of Adverse Outcome Pathway networks from AOP-Wiki.","authors":"Nurettin Yarar, Marvin Martens, Torbjørn Rognes, Jan Lavender, Hubert Dirven, Karine Audouze, Marcin W Wojewodzic","doi":"10.1093/bioadv/vbaf007","DOIUrl":"10.1093/bioadv/vbaf007","url":null,"abstract":"<p><strong>Motivation: </strong>The Adverse Outcome Pathways (AOP)-Wiki, a knowledge database for AOPs, requires an efficient way to present an overview of its content for the reconstruction of networks by experts in a given domain. We have developed the AOP-networkFinder, a user-friendly tool that retrieves AOPs of interest, allows network generation and cleaning, and finally visualizes networks built around the retrieved AOPs. Our tool constructs AOP networks by connecting AOPs that use the same Key Events (KEs) in a versatile but controlled manner. Genes related to these KEs are also displayed. The constructed networks can then be exported as images or to Cytoscape for further fine-tuning and statistical analysis.</p><p><strong>Results: </strong>The AOP-networkFinder allows users to comprehensively identify relationships between KEs and visualize the overall structure of an AOP both quickly and easily. This is immensely beneficial to researchers who need to understand the complex interplay between different KEs and the overall pathway they are studying and helps them to build further networks of interest while logging relevant information about changes within the network. These efforts are in line with the Findable, Accessible, Interoperable, and Reusable principles, which are crucial attributes for any developed databases and tools for optimizing (re)use in a dynamically changing landscape of AOP-Wiki.</p><p><strong>Availability and implementation: </strong>The AOP-networkFinder is an open-source application and is available online at aop-networkfinder.no, in the 'Computational Toxicology at Norwegian Institute of Public Health' Zenodo community at DOI 10.5281/zenodo.11068434, in the GitHub repository at github.com/folkehelseinstituttet/AOPnetworkFinder_v1, as well as in a Docker image at hub.docker.com/r/nurre123/aop_network_finder. The software is available under the GNU Affero General Public License (AGPL), v3.0. The tool uses the AOP-Wiki SPARQL endpoint to retrieve AOP data.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf007"},"PeriodicalIF":2.4,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835234/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Imputation for Lipidomics and Metabolomics (ImpLiMet): a web-based application for optimization and method selection for missing data imputation.
IF 2.4
Bioinformatics advances Pub Date : 2025-01-21 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbae209
Huiting Ou, Anuradha Surendra, Graeme S V McDowell, Emily Hashimoto-Roth, Jianguo Xia, Steffany A L Bennett, Miroslava Čuperlović-Culf
{"title":"Imputation for Lipidomics and Metabolomics (ImpLiMet): a web-based application for optimization and method selection for missing data imputation.","authors":"Huiting Ou, Anuradha Surendra, Graeme S V McDowell, Emily Hashimoto-Roth, Jianguo Xia, Steffany A L Bennett, Miroslava Čuperlović-Culf","doi":"10.1093/bioadv/vbae209","DOIUrl":"10.1093/bioadv/vbae209","url":null,"abstract":"<p><strong>Motivation: </strong>Missing values are prevalent in high-throughput measurements due to various experimental or analytical reasons. Imputation, the process of replacing missing values in a dataset with estimated values, plays an important role in multivariate and machine learning analyses. The three missingness patterns, including missing completely at random, missing at random, and missing not at random, describe unique dependencies between the missing and observed data. The optimal imputation method for each dataset depends on the type of data, the cause of the missingness, and the nature of relationships between the missing and observed data. The challenge is to identify the optimal imputation solution for a given dataset.</p><p><strong>Results: </strong>ImpLiMet: is a user-friendly web-platform that enables users to impute missing data using eight different methods. For a given dataset, ImpLiMet suggests the optimal imputation solution through a grid search-based investigation of the error rate for imputation across three missingness data simulations. The effect of imputation can be visually assessed by histogram, kurtosis, and skewness, as well as principal component analysis comparing the impact of the chosen imputation method on the distribution and overall behavior of the data.</p><p><strong>Availability and implementation: </strong>ImpLiMet is freely available at https://complimet.ca/shiny/implimet/ and https://github.com/complimet/ImpLiMet.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbae209"},"PeriodicalIF":2.4,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11761345/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143048755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate prediction of nucleic acid binding proteins using protein language model.
IF 2.4
Bioinformatics advances Pub Date : 2025-01-20 eCollection Date: 2025-01-01 DOI: 10.1093/bioadv/vbaf008
Siwen Wu, Jinbo Xu, Jun-Tao Guo
{"title":"Accurate prediction of nucleic acid binding proteins using protein language model.","authors":"Siwen Wu, Jinbo Xu, Jun-Tao Guo","doi":"10.1093/bioadv/vbaf008","DOIUrl":"10.1093/bioadv/vbaf008","url":null,"abstract":"<p><strong>Motivation: </strong>Nucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions.</p><p><strong>Results: </strong>To improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored.</p><p><strong>Availability and implementation: </strong>The datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf008"},"PeriodicalIF":2.4,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11845279/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信