{"title":"The relational modeling of hierarchical data in biodiversity databases.","authors":"Petr Novotný, Jan Wild","doi":"10.1093/database/baae107","DOIUrl":"10.1093/database/baae107","url":null,"abstract":"<p><p>The unifying element of all biodiversity data is the issue of taxon hierarchy modeling. We compared 25 existing databases in terms of handling taxa hierarchy and presentation of this data. We used documentation or demo installations of databases as a source of information and next in line was the analysis of structures using R packages provided by inspected platforms. If neither of these was available, we used the public interface of individual databases. For almost half (12) of the databases analyzed, we did not find any formalized taxa hierarchy data structure, providing only biological information about taxon membership in higher ranks, which is not fully formalizable and thus not generally usable. The least effective Adjacency List model (storing parentId of a taxon) dominates among the remaining providers. This study demonstrates the lack of attention paid by current biodiversity databases to modeling taxon hierarchy, particularly to making it available to researchers in the form of a hierarchical data structure within the data provided. For biodiversity relational databases, the Closure Table type is the most suitable of the known data models, which also corresponds to the ontology concept. However, its use is rather sporadic within the biodiversity databases ecosystem.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11466226/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142399684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.","authors":"Cong-Phuoc Phan, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baae104","DOIUrl":"10.1093/database/baae104","url":null,"abstract":"<p><p>Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11463225/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142388766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.","authors":"","doi":"10.1093/database/baae110","DOIUrl":"https://doi.org/10.1093/database/baae110","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142364755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPMKG: a condition-based knowledge graph for precision medicine.","authors":"Jiaxin Yang, Xinhao Zhuang, Zhenqi Li, Gang Xiong, Ping Xu, Yunchao Ling, Guoqing Zhang","doi":"10.1093/database/baae102","DOIUrl":"https://doi.org/10.1093/database/baae102","url":null,"abstract":"<p><p>Personalized medicine tailors treatments and dosages based on a patient's unique characteristics, particularly its genetic profile. Over the decades, stratified research and clinical trials have uncovered crucial drug-related information-such as dosage, effectiveness, and side effects-affecting specific individuals with particular genetic backgrounds. This genetic-specific knowledge, characterized by complex multirelationships and conditions, cannot be adequately represented or stored in conventional knowledge systems. To address these challenges, we developed CPMKG, a condition-based platform that enables comprehensive knowledge representation. Through information extraction and meticulous curation, we compiled 307 614 knowledge entries, encompassing thousands of drugs, diseases, phenotypes (complications/side effects), genes, and genomic variations across four key categories: drug side effects, drug sensitivity, drug mechanisms, and drug indications. CPMKG facilitates drug-centric exploration and enables condition-based multiknowledge inference, accelerating knowledge discovery through three pivotal applications. To enhance user experience, we seamlessly integrated a sophisticated large language model that provides textual interpretations for each subgraph, bridging the gap between structured graphs and language expressions. With its comprehensive knowledge graph and user-centric applications, CPMKG serves as a valuable resource for clinical research, offering drug information tailored to personalized genetic profiles, syndromes, and phenotypes. Database URL: https://www.biosino.org/cpmkg/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11429523/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan
{"title":"Automated annotation of scientific texts for ML-based keyphrase extraction and validation.","authors":"Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan","doi":"10.1093/database/baae093","DOIUrl":"https://doi.org/10.1093/database/baae093","url":null,"abstract":"<p><p>Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142343320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PeptiHub: a curated repository of precisely annotated cancer-related peptides with advanced utilities for peptide exploration and discovery.","authors":"Sara Zareei, Babak Khorsand, Alireza Dantism, Neda Zareei, Fereshteh Asgharzadeh, Shadi Shams Zahraee, Samane Mashreghi Kashan, Shirin Hekmatirad, Shila Amini, Fatemeh Ghasemi, Maryam Moradnia, Atena Vaghf, Anahid Hemmatpour, Hamdam Hourfar, Soudabeh Niknia, Ali Johari, Fatemeh Salimi, Neda Fariborzi, Zohreh Shojaei, Elaheh Asiaei, Hossein Shabani","doi":"10.1093/database/baae092","DOIUrl":"10.1093/database/baae092","url":null,"abstract":"<p><p>Peptihub (https://bioinformaticscollege.ir/peptihub/) is a meticulously curated repository of cancer-related peptides (CRPs) that have been documented in scientific literature. A diverse collection of CRPs is included in the PeptiHub, showcasing a spectrum of effects and activities. While some peptides demonstrated significant anticancer efficacy, others exhibited no discernible impact, and some even possessed alternative non-drug functionalities, including drug carrier or carcinogenic attributes. Presently, Peptihub houses 874 CRPs, subjected to evaluation across 10 distinct organism categories, 26 organs, and 438 cell lines. Each entry in the database is accompanied by easily accessible 3D conformations, obtained either experimentally or through predictive methodology. Users are provided with three search frameworks offering basic, advanced, and BLAST sequence search options. Furthermore, precise annotations of peptides enable users to explore CRPs based on their specific activities (anticancer, no effect, insignificant effect, carcinogen, and others) and their effectiveness (rate and IC50) under cancer conditions, specifically within individual organs. This unique property facilitates the construction of robust training and testing datasets. Additionally, PeptiHub offers 1141 features with the convenience of selecting the most pertinent features to address their specific research questions. Features include aaindex1 (in six main subcategories: alpha propensities, beta propensity, composition indices, hydrophobicity, physicochemical properties, and other properties), amino acid composition (Amino acid Composition and Dipeptide Composition), and Grouped Amino Acid Composition (Grouped amino acid composition, Grouped dipeptide composition, and Conjoint triad) categories. These utilities not only speed up machine learning-based peptide design but also facilitate peptide classification. Database URL: https://bioinformaticscollege.ir/peptihub/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11417155/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral
{"title":"Autophagy3D: a comprehensive autophagy structure database.","authors":"Neha, Jesu Castin, Saman Fatihi, Deepanshi Gahlot, Akanksha Arun, Lipi Thukral","doi":"10.1093/database/baae088","DOIUrl":"10.1093/database/baae088","url":null,"abstract":"<p><p>Autophagy pathway plays a central role in cellular degradation. The proteins involved in the core autophagy process are mostly localised on membranes or interact indirectly with lipid-associated proteins. Therefore, progress in structure determination of 'core autophagy proteins' remained relatively limited. Recent paradigm shift in structural biology that includes cutting-edge cryo-EM technology and robust AI-based Alphafold2 predicted models has significantly increased data points in biology. Here, we developed Autophagy3D, a web-based resource that provides an efficient way to access data associated with 40 core human autophagic proteins (80322 structures), their protein-protein interactors and ortholog structures from various species. Autophagy3D also offers detailed visualizations of protein structures, and, hence deriving direct biological insights. The database significantly enhances access to information as full datasets are available for download. The Autophagy3D can be publicly accessed via https://autophagy3d.igib.res.in. Database URL: https://autophagy3d.igib.res.in.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11412239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki
{"title":"Influenza sequence validation and annotation using VADR.","authors":"Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki","doi":"10.1093/database/baae091","DOIUrl":"10.1093/database/baae091","url":null,"abstract":"<p><p>Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLu ANnotation tool (FLAN) has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions and has been publicly available as a webserver but not as a standalone tool. Viral Annotation DefineR (VADR) is a general sequence validation and annotation software package used by GenBank for norovirus, dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree, VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use. Database URL: https://bitbucket.org/nawrockie/vadr-models-flu.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11411204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang
{"title":"collectNET: a web server for integrated inference of cell-cell communication network.","authors":"Yan Pan, Zijing Gao, Xuejian Cui, Zhen Li, Rui Jiang","doi":"10.1093/database/baae098","DOIUrl":"https://doi.org/10.1093/database/baae098","url":null,"abstract":"<p><p>Cell-cell communication (CCC) through ligand-receptor (L-R) pairs forms the cornerstone for complex functionalities in multicellular organisms. Deciphering such intercellular signaling can contribute to unraveling disease mechanisms and enable targeted therapy. Nonetheless, notable biases and inconsistencies are evident among the inferential outcomes generated by current methods for inferring CCC network. To fill this gap, we developed collectNET (http://health.tsinghua.edu.cn/collectnet) as a comprehensive web platform for analyzing CCC network, with efficient calculation, hierarchical browsing, comprehensive statistics, advanced searching, and intuitive visualization. collectNET provides a reliable online inference service with prior knowledge of three public L-R databases and systematic integration of three mainstream inference methods. Additionally, collectNET has assembled a human CCC atlas, including 126 785 significant communication pairs based on 343 023 cells. We anticipate that collectNET will benefit researchers in gaining a more holistic understanding of cell development and differentiation mechanisms. Database URL: http://health.tsinghua.edu.cn/collectnet.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.","authors":"Ander Martinez, Nuria García-Santa","doi":"10.1093/database/baae101","DOIUrl":"https://doi.org/10.1093/database/baae101","url":null,"abstract":"<p><p>This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11403810/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142281701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}