Nicolas Haas, Julie Dawn Thompson, Jean-Paul Renaud, Kirsley Chennen, Olivier Poch
{"title":"StopKB: a comprehensive knowledgebase for nonsense suppression therapies.","authors":"Nicolas Haas, Julie Dawn Thompson, Jean-Paul Renaud, Kirsley Chennen, Olivier Poch","doi":"10.1093/database/baae108","DOIUrl":"10.1093/database/baae108","url":null,"abstract":"<p><p>Nonsense variations, characterized by premature termination codons, play a major role in human genetic diseases as well as in cancer susceptibility. Despite their high prevalence, effective therapeutic strategies targeting premature termination codons remain a challenge. To understand and explore the intricate mechanisms involved, we developed StopKB, a comprehensive knowledgebase aggregating data from multiple sources on nonsense variations, associated genes, diseases, and phenotypes. StopKB identifies 637 317 unique nonsense variations, distributed across 18 022 human genes and linked to 3206 diseases and 7765 phenotypes. Notably, ∼32% of these variations are classified as nonsense-mediated mRNA decay-insensitive, potentially representing suitable targets for nonsense suppression therapies. We also provide an interactive web interface to facilitate efficient and intuitive data exploration, enabling researchers and clinicians to navigate the complex landscape of nonsense variations. StopKB represents a valuable resource for advancing research in precision medicine and more specifically, the development of targeted therapeutic interventions for genetic diseases associated with nonsense variations. Database URL: https://lbgi.fr/stopkb/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11470752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142460045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The relational modeling of hierarchical data in biodiversity databases.","authors":"Petr Novotný, Jan Wild","doi":"10.1093/database/baae107","DOIUrl":"10.1093/database/baae107","url":null,"abstract":"<p><p>The unifying element of all biodiversity data is the issue of taxon hierarchy modeling. We compared 25 existing databases in terms of handling taxa hierarchy and presentation of this data. We used documentation or demo installations of databases as a source of information and next in line was the analysis of structures using R packages provided by inspected platforms. If neither of these was available, we used the public interface of individual databases. For almost half (12) of the databases analyzed, we did not find any formalized taxa hierarchy data structure, providing only biological information about taxon membership in higher ranks, which is not fully formalizable and thus not generally usable. The least effective Adjacency List model (storing parentId of a taxon) dominates among the remaining providers. This study demonstrates the lack of attention paid by current biodiversity databases to modeling taxon hierarchy, particularly to making it available to researchers in the form of a hierarchical data structure within the data provided. For biodiversity relational databases, the Closure Table type is the most suitable of the known data models, which also corresponds to the ontology concept. However, its use is rather sporadic within the biodiversity databases ecosystem.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11466226/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.","authors":"Cong-Phuoc Phan, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baae104","DOIUrl":"10.1093/database/baae104","url":null,"abstract":"<p><p>Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463225/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142388766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.","authors":"","doi":"10.1093/database/baae110","DOIUrl":"https://doi.org/10.1093/database/baae110","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan
{"title":"Automated annotation of scientific texts for ML-based keyphrase extraction and validation.","authors":"Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan","doi":"10.1093/database/baae093","DOIUrl":"https://doi.org/10.1093/database/baae093","url":null,"abstract":"<p><p>Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPMKG: a condition-based knowledge graph for precision medicine.","authors":"Jiaxin Yang, Xinhao Zhuang, Zhenqi Li, Gang Xiong, Ping Xu, Yunchao Ling, Guoqing Zhang","doi":"10.1093/database/baae102","DOIUrl":"https://doi.org/10.1093/database/baae102","url":null,"abstract":"<p><p>Personalized medicine tailors treatments and dosages based on a patient's unique characteristics, particularly its genetic profile. Over the decades, stratified research and clinical trials have uncovered crucial drug-related information-such as dosage, effectiveness, and side effects-affecting specific individuals with particular genetic backgrounds. This genetic-specific knowledge, characterized by complex multirelationships and conditions, cannot be adequately represented or stored in conventional knowledge systems. To address these challenges, we developed CPMKG, a condition-based platform that enables comprehensive knowledge representation. Through information extraction and meticulous curation, we compiled 307 614 knowledge entries, encompassing thousands of drugs, diseases, phenotypes (complications/side effects), genes, and genomic variations across four key categories: drug side effects, drug sensitivity, drug mechanisms, and drug indications. CPMKG facilitates drug-centric exploration and enables condition-based multiknowledge inference, accelerating knowledge discovery through three pivotal applications. To enhance user experience, we seamlessly integrated a sophisticated large language model that provides textual interpretations for each subgraph, bridging the gap between structured graphs and language expressions. With its comprehensive knowledge graph and user-centric applications, CPMKG serves as a valuable resource for clinical research, offering drug information tailored to personalized genetic profiles, syndromes, and phenotypes. Database URL: https://www.biosino.org/cpmkg/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11429523/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PeptiHub: a curated repository of precisely annotated cancer-related peptides with advanced utilities for peptide exploration and discovery.","authors":"Sara Zareei, Babak Khorsand, Alireza Dantism, Neda Zareei, Fereshteh Asgharzadeh, Shadi Shams Zahraee, Samane Mashreghi Kashan, Shirin Hekmatirad, Shila Amini, Fatemeh Ghasemi, Maryam Moradnia, Atena Vaghf, Anahid Hemmatpour, Hamdam Hourfar, Soudabeh Niknia, Ali Johari, Fatemeh Salimi, Neda Fariborzi, Zohreh Shojaei, Elaheh Asiaei, Hossein Shabani","doi":"10.1093/database/baae092","DOIUrl":"10.1093/database/baae092","url":null,"abstract":"<p><p>Peptihub (https://bioinformaticscollege.ir/peptihub/) is a meticulously curated repository of cancer-related peptides (CRPs) that have been documented in scientific literature. A diverse collection of CRPs is included in the PeptiHub, showcasing a spectrum of effects and activities. While some peptides demonstrated significant anticancer efficacy, others exhibited no discernible impact, and some even possessed alternative non-drug functionalities, including drug carrier or carcinogenic attributes. Presently, Peptihub houses 874 CRPs, subjected to evaluation across 10 distinct organism categories, 26 organs, and 438 cell lines. Each entry in the database is accompanied by easily accessible 3D conformations, obtained either experimentally or through predictive methodology. Users are provided with three search frameworks offering basic, advanced, and BLAST sequence search options. Furthermore, precise annotations of peptides enable users to explore CRPs based on their specific activities (anticancer, no effect, insignificant effect, carcinogen, and others) and their effectiveness (rate and IC50) under cancer conditions, specifically within individual organs. This unique property facilitates the construction of robust training and testing datasets. Additionally, PeptiHub offers 1141 features with the convenience of selecting the most pertinent features to address their specific research questions. Features include aaindex1 (in six main subcategories: alpha propensities, beta propensity, composition indices, hydrophobicity, physicochemical properties, and other properties), amino acid composition (Amino acid Composition and Dipeptide Composition), and Grouped Amino Acid Composition (Grouped amino acid composition, Grouped dipeptide composition, and Conjoint triad) categories. These utilities not only speed up machine learning-based peptide design but also facilitate peptide classification. Database URL: https://bioinformaticscollege.ir/peptihub/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417155/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral
{"title":"Autophagy3D: a comprehensive autophagy structure database.","authors":"Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral","doi":"10.1093/database/baae088","DOIUrl":"10.1093/database/baae088","url":null,"abstract":"<p><p>Autophagy pathway plays a central role in cellular degradation. The proteins involved in the core autophagy process are mostly localised on membranes or interact indirectly with lipid-associated proteins. Therefore, progress in structure determination of 'core autophagy proteins' remained relatively limited. Recent paradigm shift in structural biology that includes cutting-edge cryo-EM technology and robust AI-based Alphafold2 predicted models has significantly increased data points in biology. Here, we developed Autophagy3D, a web-based resource that provides an efficient way to access data associated with 40 core human autophagic proteins (80322 structures), their protein-protein interactors and ortholog structures from various species. Autophagy3D also offers detailed visualizations of protein structures, and, hence deriving direct biological insights. The database significantly enhances access to information as full datasets are available for download. The Autophagy3D can be publicly accessed via https://autophagy3d.igib.res.in. Database URL: https://autophagy3d.igib.res.in.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11412239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki
{"title":"Influenza sequence validation and annotation using VADR.","authors":"Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki","doi":"10.1093/database/baae091","DOIUrl":"10.1093/database/baae091","url":null,"abstract":"<p><p>Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLu ANnotation tool (FLAN) has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions and has been publicly available as a webserver but not as a standalone tool. Viral Annotation DefineR (VADR) is a general sequence validation and annotation software package used by GenBank for norovirus, dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree, VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use. Database URL: https://bitbucket.org/nawrockie/vadr-models-flu.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11411204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang
{"title":"collectNET: a web server for integrated inference of cell-cell communication network.","authors":"Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang","doi":"10.1093/database/baae098","DOIUrl":"https://doi.org/10.1093/database/baae098","url":null,"abstract":"<p><p>Cell-cell communication (CCC) through ligand-receptor (L-R) pairs forms the cornerstone for complex functionalities in multicellular organisms. Deciphering such intercellular signaling can contribute to unraveling disease mechanisms and enable targeted therapy. Nonetheless, notable biases and inconsistencies are evident among the inferential outcomes generated by current methods for inferring CCC network. To fill this gap, we developed collectNET (http://health.tsinghua.edu.cn/collectnet) as a comprehensive web platform for analyzing CCC network, with efficient calculation, hierarchical browsing, comprehensive statistics, advanced searching, and intuitive visualization. collectNET provides a reliable online inference service with prior knowledge of three public L-R databases and systematic integration of three mainstream inference methods. Additionally, collectNET has assembled a human CCC atlas, including 126 785 significant communication pairs based on 343 023 cells. We anticipate that collectNET will benefit researchers in gaining a more holistic understanding of cell development and differentiation mechanisms. Database URL: http://health.tsinghua.edu.cn/collectnet.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}