Ara Monadjem, Richard C Boycott, Thea Litscha-Koen, Adam Kane, Wisdom M Dlamini, Lindelwa Mmema, Katharine L Strutton, Zakhele Hlophe, Sara Padidar
{"title":"A database on the historical and current occurrences of snakes in Eswatini.","authors":"Ara Monadjem, Richard C Boycott, Thea Litscha-Koen, Adam Kane, Wisdom M Dlamini, Lindelwa Mmema, Katharine L Strutton, Zakhele Hlophe, Sara Padidar","doi":"10.1093/database/baaf040","DOIUrl":"https://doi.org/10.1093/database/baaf040","url":null,"abstract":"<p><p>Snakes are among the most difficult terrestrial vertebrates to survey, resulting in poor distributional information on most species. This database comprises of 3812 records of 58 species of snakes in 37 genera reported from within the boundaries of Eswatini. The data were compiled from multiple sources including museum specimens, iNaturalist records, literature records, and snake rescue operations. For each specimen reported in the database, we provide the scientific name, latitude and longitude coordinates, and location. Most records also have an associated date. This comprehensive database will be useful to biodiversity experts, conservationists, medical practitioners, researchers, and snake enthusiasts, especially for mapping and modelling snake distributions in the country. To allow easy viewing of the distribution of snakes in the country, we provide an online visualization tool, which should allow a greater number of non-scientists to utilize this database.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: GymnoTOA-db: a database and application to optimize functional annotation in gymnosperms.","authors":"","doi":"10.1093/database/baaf041","DOIUrl":"https://doi.org/10.1093/database/baaf041","url":null,"abstract":"","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evgeniia M Maksiutenko, Igor V Bezdvornykh, Yury A Barbitoff, Yulia A Nasykhova, Andrey S Glotov
{"title":"PLoV: a comprehensive database of genetic variants leading to pregnancy loss.","authors":"Evgeniia M Maksiutenko, Igor V Bezdvornykh, Yury A Barbitoff, Yulia A Nasykhova, Andrey S Glotov","doi":"10.1093/database/baaf037","DOIUrl":"https://doi.org/10.1093/database/baaf037","url":null,"abstract":"<p><p>Pregnancy loss is an important reproductive health problem that affects many couples. Genetic factors play an important role in both spontaneous miscarriage and recurrent pregnancy loss, and the effect of genomic variants is recognized as one of the major causes of pregnancy loss in euploid foetuses. In this work, we extend our previous analysis of the genetic landscape of pregnancy loss and develop a Pregnancy Loss genetic Variant (PLoV) database to aggregate information about mutations that have been implicated in pregnancy loss. The database contains information about 534 genetic variants that have been observed in 421 cases across 47 studies, including foetus-only, parent-only, and trio-based studies. For each case, the database includes a detailed description of the phenotype, including ultrasound data (if provided in the original article). The genetic variants are scattered across all chromosomes in the human genome and affect a total of 292 unique genes. We provide a public access to the PLoV database at https://plovdb.ott.ru/. Database URL: https://plovdb.ott.ru/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BgDB: a comprehensive genomic resource information system of bitter gourd for accelerated breeding programme.","authors":"Princy Saini, Ankita Singh, Tilak Chandra, Dheeraj Kumar Chaurasia, Kunal Chaudhary, Priyanka Jain, G Boopalakrishnan, Sarika Jaiswal, Shyam Sunder Dey, Tusar Kanti Behera, Ulavappa Basavanneppa Angadi, Mir Asif Iquebal, Dinesh Kumar","doi":"10.1093/database/baaf039","DOIUrl":"https://doi.org/10.1093/database/baaf039","url":null,"abstract":"<p><p>Bitter gourd, scientifically known as Momordica charantia L. with 2n = 22, is a widely recognized medicinal vegetable, renowned for its multifaceted health benefits, primarily acclaimed for its lipid- and glucose-lowering effects. Its growing demands as a food source and for industrial applications necessitate value addition in ongoing breeding initiatives to enhance genotypic traits in multifarious ways. A thorough understanding of the underlying molecular footprint is warranted for characterization, which still remains underexplored relative to other cash crops. Though a chromosome-level genome assembly of bitter gourd is available, scattered and fragmented information becomes an obstacle for assisted breeding and gene editing. Therefore, it is crucial to further dissect structural and molecular variants, noncoding RNAs (ncRNAs), transcription factors, and transcripts from whole-genome and resequencing projects. The present study leads to the development of a comprehensive genomic resource, BgDB (Bitter Gourd Resource Database) at a single platform, vital for advanced bitter gourd breeding programmes for raising bitter gourd varieties with traits of significant social and economic value. BgDB, available at https://bgdb.daasbioinfromaticsteam.in/index.php, is a user-friendly, three-tier database that offers a comprehensive interface with detailed analysed information, including 114 598 transcripts, 4914 differentially expressed genes, 32 570 predicted simple sequence repeat markers, and 162 850 primers for downstream applications. It also catalogues extensive annotations of bitter gourd-specific single nucleotide polymorphisms/insertions and deletions, long noncoding RNAs, circular RNAs, microRNAs, 1220 transcription factors, 295 transcription regulators, and 146 quantitative trait loci (QTL) distributed throughout the chromosomes. This genomic resource is poised to significantly advance genetic diversity analyses, population and varietal differentiation, and trait optimization. It further facilitates the exploration of regulatory ncRNA elements, key transcripts, and essential transcription factors and regulators. The discovery of QTL will aid in the development of improved bitter gourd varieties in the endeavour of enhanced productivity. Beyond comprehensive datasets, the future integration of multi-omics resources could profoundly advance and fully unlock the potential of databases. Database URL: https://bgdb.daasbioinfromaticsteam.in/index.php.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":" ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144583338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang
{"title":"CAS: enhancing implicit constrained data augmentation with semantic enrichment for biomedical relation extraction and beyond.","authors":"Fang-Yi Su, Gia-Han Ngo, Ben Phan, Jung-Hsien Chiang","doi":"10.1093/database/baaf025","DOIUrl":"10.1093/database/baaf025","url":null,"abstract":"<p><p>Biomedical relation extraction often involves datasets with implicit constraints, where structural, syntactic, or semantic rules must be strictly preserved to maintain data integrity. Traditional data augmentation techniques struggle in these scenarios, as they risk violating domain-specific constraints. To address these challenges, we propose CAS (Constrained Augmentation and Semantic-Quality), a novel framework designed for constrained datasets. CAS employs large language models to generate diverse data variations while adhering to predefined rules, and it integrates the SemQ Filter. This self-evaluation mechanism ensures the quality and consistency of augmented data by filtering out noisy or semantically incongruent samples. Although CAS is primarily designed for biomedical relation extraction, its versatile design extends its applicability to tasks with implicit constraints, such as code completion, mathematical reasoning, and information retrieval. Through extensive experiments across multiple domains, CAS demonstrates its ability to enhance model performance by maintaining structural fidelity and semantic accuracy in augmented data. These results highlight the potential of CAS not only in advancing biomedical NLP research but also in addressing data augmentation challenges in diverse constrained-task settings within natural language processing. Database URL: https://github.com/ngogiahan149/CAS.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12224179/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144552558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel
{"title":"Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.","authors":"Muhammad Nabeel Asim, Tayyaba Asif, Faiza Hassan, Andreas Dengel","doi":"10.1093/database/baaf027","DOIUrl":"10.1093/database/baaf027","url":null,"abstract":"<p><p>Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144191613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing biomedical relation extraction through data-centric and preprocessing-robust ensemble learning approach.","authors":"Wilailack Meesawad, Jen-Chieh Han, Chun-Yu Hsueh, Yu Zhang, Hsi-Chuan Hung, Richard Tzong-Han Tsai","doi":"10.1093/database/baae127","DOIUrl":"10.1093/database/baae127","url":null,"abstract":"<p><p>The paper describes our biomedical relation extraction system, which is designed to participate in the BioCreative VIII challenge Track 1: BioRED Track, which emphasizes the relation extraction from biomedical literature. Our system employs an ensemble learning method, leveraging the PubTator API in conjunction with multiple pretrained bidirectional encoder representations from transformer (BERT) models. Various preprocessing inputs are incorporated, encompassing prompt questions, entity ID pairs, and co-occurrence contexts. To enhance model comprehension, special tokens and boundary tags are incorporated. Specifically, we utilize PubMedBERT alongside the Max Rule ensemble learning mechanism to amalgamate outputs from diverse classifiers. Our findings surpass the established benchmark score, thereby providing a robust benchmark for evaluating performance in this task. Moreover, our study introduces and demonstrates the effectiveness of a data-centric approach, emphasizing the significance of prioritizing high-quality data instances in enhancing model performance and robustness.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097206/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"https://doi.org/10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144126211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz
{"title":"An exploratory study combining Virtual Reality and Semantic Web for life science research using Graph2VR.","authors":"Alexander J Kellmann, Sander van den Hoek, Max Postema, W T Kars Maassen, Brenda S Hijmans, Marije A van der Geest, K Joeri van der Velde, Esther J van Enckevort, Morris A Swertz","doi":"10.1093/database/baaf008","DOIUrl":"10.1093/database/baaf008","url":null,"abstract":"<p><p>We previously described Graph2VR, a prototype that enables researchers to use virtual reality (VR) to explore and navigate through Linked Data graphs using SPARQL queries (see https://doi.org/10.1093/database/baae008). Here we evaluate the use of Graph2VR in three realistic life science use cases. The first use case visualizes metadata from large-scale multi-center cohort studies across Europe and Canada via the EUCAN Connect catalogue. The second use case involves a set of genomic data from synthetic rare disease patients, which was processed through the Variant Interpretation Pipeline and then converted into Resource Description Format for visualization. The third use case involves enriching a graph with additional information, in this case, the Dutch Anatomical Therapeutic Chemical code Ontology with the DrugID from Drugbank. These examples collectively showcase Graph2VR's potential for data exploration and enrichment, as well as some of its limitations. We conclude that the endless three-dimensional space provided by VR indeed shows much potential for the navigation of very large knowledge graphs, and we provide recommendations for data preparation and VR tooling moving forward. Database URL: https://doi.org/10.1093/database/baaf008.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12090995/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144110024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database.","authors":"Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini","doi":"10.1093/database/baaf035","DOIUrl":"10.1093/database/baaf035","url":null,"abstract":"<p><p>Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12063530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143978712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}