B Damayanthi Jesudas, Sam Smith, Feng-Yu Yeh, Jie Zheng, John Beverley, William D Duncan, Yongqun He
{"title":"BERTopic-driven term extraction from biomedical texts toward ontology population: evaluating vaccine ontology with Plotkin's vaccines corpus.","authors":"B Damayanthi Jesudas, Sam Smith, Feng-Yu Yeh, Jie Zheng, John Beverley, William D Duncan, Yongqun He","doi":"10.1186/s13326-026-00353-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Ontologies are essential for structuring biomedical knowledge, supporting semantic integration, reasoning, and data interoperability. In vaccinology, ontology population is particularly critical, as vaccines span diverse domains. A well-defined Vaccine Ontology (VO) enables consistent knowledge representation, integration across datasets, and supports applications such as decision support, literature mining, and semantic search. However, manual ontology population is tedious, time-consuming, and difficult to maintain in this dynamically evolving domain, underscoring the need for automated or semi-automated population approaches.</p><p><strong>Methods: </strong>We present a semi-automated pipeline that uses Bidirectional Encoder Representations from Transformers and Topic Modeling (BERTopic) to extract ontology-relevant concepts from biomedical text. To evaluate the effectiveness of this automated approach, the method is applied to Plotkin's Vaccines corpus, a leading reference text in vaccinology that synthesizes scientific, clinical, and policy perspectives on vaccines. The workflow integrates multiple natural language processing (NLP) components: document preprocessing with spaCy part-of-speech tagging and vectorization, sentence embeddings generated by a lightweight transformer model (all-MiniLM-L6-v2), dimensionality reduction with Uniform Manifold Approximation and Projection (UMAP), clustering with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and topic representation via Class-based Term Frequency - Inverse Document Frequency (c-TF-IDF). To guide topic discovery toward vaccine-relevant concepts and filter irrelevant terms, the pipeline incorporates a curated set of vaccine-focused terms derived from an existing vaccine ontology as seed words to influence topic representations, while preserving the unsupervised nature of the clustering process. To enhance interpretability, the pipeline employs Keyword extraction using BERT embeddings (KeyBERT) for automatic keyword-based labeling, supplemented with disambiguated descriptive labels, and Bidirectional and Auto-Regressive Transformer (BART) summarization for topic-level summaries. The resulting hierarchical topic structures are further refined through a tree-merging module that unifies multiple topic hierarchies into a coherent ontology-like representation. The extracted topics are reviewed by the Subject Matter Experts (SMEs) to filter irrelevant terms and then mapped to Vaccine Ontology, a well-established ontology to assess their relevance and coverage, demonstrating how automated methods can reduce the labor-intensive effort required for manual ontology population.</p><p><strong>Results: </strong>The script is customized to generate a varying number of topics and keywords. In this study, the top 50 topics with 10 keywords per topic were extracted for each chapter of Plotkin's vaccines. The pipeline produced coherent topic clusters representing key themes in vaccinology, including immune mechanisms, pathogen-specific vaccines, and vaccine types. The hierarchical tree-merging process is used to illustrate how semantically related concept groupings emerge and can suggest potential ontology subdivisions. This serves as a visualization of conceptual relationships derived from the data and is particularly helpful for SMEs to review, interpret, and validate candidate concepts.</p><p><strong>Conclusions: </strong>This study demonstrates the feasibility of BERTopic-driven, a semi-automated approach for extracting ontology-relevant concepts from biomedical texts. The method was evaluated using a foundational vaccinology corpus and assessed against an existing, well-developed vaccine ontology to determine the relevance and coverage of the extracted topics. Mapping the topics to the established ontology enabled identification of concept alignments and irrelevant terms, which were subsequently reviewed by SMEs. The results show that the proposed approach can effectively surface meaningful, ontology-relevant concepts while significantly reducing the time and effort for manual population, thereby providing a scalable strategy for supporting ontology maintenance and enrichment.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2026-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-026-00353-w","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Ontologies are essential for structuring biomedical knowledge, supporting semantic integration, reasoning, and data interoperability. In vaccinology, ontology population is particularly critical, as vaccines span diverse domains. A well-defined Vaccine Ontology (VO) enables consistent knowledge representation, integration across datasets, and supports applications such as decision support, literature mining, and semantic search. However, manual ontology population is tedious, time-consuming, and difficult to maintain in this dynamically evolving domain, underscoring the need for automated or semi-automated population approaches.
Methods: We present a semi-automated pipeline that uses Bidirectional Encoder Representations from Transformers and Topic Modeling (BERTopic) to extract ontology-relevant concepts from biomedical text. To evaluate the effectiveness of this automated approach, the method is applied to Plotkin's Vaccines corpus, a leading reference text in vaccinology that synthesizes scientific, clinical, and policy perspectives on vaccines. The workflow integrates multiple natural language processing (NLP) components: document preprocessing with spaCy part-of-speech tagging and vectorization, sentence embeddings generated by a lightweight transformer model (all-MiniLM-L6-v2), dimensionality reduction with Uniform Manifold Approximation and Projection (UMAP), clustering with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and topic representation via Class-based Term Frequency - Inverse Document Frequency (c-TF-IDF). To guide topic discovery toward vaccine-relevant concepts and filter irrelevant terms, the pipeline incorporates a curated set of vaccine-focused terms derived from an existing vaccine ontology as seed words to influence topic representations, while preserving the unsupervised nature of the clustering process. To enhance interpretability, the pipeline employs Keyword extraction using BERT embeddings (KeyBERT) for automatic keyword-based labeling, supplemented with disambiguated descriptive labels, and Bidirectional and Auto-Regressive Transformer (BART) summarization for topic-level summaries. The resulting hierarchical topic structures are further refined through a tree-merging module that unifies multiple topic hierarchies into a coherent ontology-like representation. The extracted topics are reviewed by the Subject Matter Experts (SMEs) to filter irrelevant terms and then mapped to Vaccine Ontology, a well-established ontology to assess their relevance and coverage, demonstrating how automated methods can reduce the labor-intensive effort required for manual ontology population.
Results: The script is customized to generate a varying number of topics and keywords. In this study, the top 50 topics with 10 keywords per topic were extracted for each chapter of Plotkin's vaccines. The pipeline produced coherent topic clusters representing key themes in vaccinology, including immune mechanisms, pathogen-specific vaccines, and vaccine types. The hierarchical tree-merging process is used to illustrate how semantically related concept groupings emerge and can suggest potential ontology subdivisions. This serves as a visualization of conceptual relationships derived from the data and is particularly helpful for SMEs to review, interpret, and validate candidate concepts.
Conclusions: This study demonstrates the feasibility of BERTopic-driven, a semi-automated approach for extracting ontology-relevant concepts from biomedical texts. The method was evaluated using a foundational vaccinology corpus and assessed against an existing, well-developed vaccine ontology to determine the relevance and coverage of the extracted topics. Mapping the topics to the established ontology enabled identification of concept alignments and irrelevant terms, which were subsequently reviewed by SMEs. The results show that the proposed approach can effectively surface meaningful, ontology-relevant concepts while significantly reducing the time and effort for manual population, thereby providing a scalable strategy for supporting ontology maintenance and enrichment.
期刊介绍:
Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:
Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.
Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.