BERTopic-driven term extraction from biomedical texts toward ontology population: evaluating vaccine ontology with Plotkin's vaccines corpus.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Biomedical Semantics Pub Date : 2026-05-03 DOI:10.1186/s13326-026-00353-w

B Damayanthi Jesudas, Sam Smith, Feng-Yu Yeh, Jie Zheng, John Beverley, William D Duncan, Yongqun He

{"title":"BERTopic-driven term extraction from biomedical texts toward ontology population: evaluating vaccine ontology with Plotkin's vaccines corpus.","authors":"B Damayanthi Jesudas, Sam Smith, Feng-Yu Yeh, Jie Zheng, John Beverley, William D Duncan, Yongqun He","doi":"10.1186/s13326-026-00353-w","DOIUrl":null,"url":null,"abstract":"Background: Ontologies are essential for structuring biomedical knowledge, supporting semantic integration, reasoning, and data interoperability. In vaccinology, ontology population is particularly critical, as vaccines span diverse domains. A well-defined Vaccine Ontology (VO) enables consistent knowledge representation, integration across datasets, and supports applications such as decision support, literature mining, and semantic search. However, manual ontology population is tedious, time-consuming, and difficult to maintain in this dynamically evolving domain, underscoring the need for automated or semi-automated population approaches.Methods: We present a semi-automated pipeline that uses Bidirectional Encoder Representations from Transformers and Topic Modeling (BERTopic) to extract ontology-relevant concepts from biomedical text. To evaluate the effectiveness of this automated approach, the method is applied to Plotkin's Vaccines corpus, a leading reference text in vaccinology that synthesizes scientific, clinical, and policy perspectives on vaccines. The workflow integrates multiple natural language processing (NLP) components: document preprocessing with spaCy part-of-speech tagging and vectorization, sentence embeddings generated by a lightweight transformer model (all-MiniLM-L6-v2), dimensionality reduction with Uniform Manifold Approximation and Projection (UMAP), clustering with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and topic representation via Class-based Term Frequency - Inverse Document Frequency (c-TF-IDF). To guide topic discovery toward vaccine-relevant concepts and filter irrelevant terms, the pipeline incorporates a curated set of vaccine-focused terms derived from an existing vaccine ontology as seed words to influence topic representations, while preserving the unsupervised nature of the clustering process. To enhance interpretability, the pipeline employs Keyword extraction using BERT embeddings (KeyBERT) for automatic keyword-based labeling, supplemented with disambiguated descriptive labels, and Bidirectional and Auto-Regressive Transformer (BART) summarization for topic-level summaries. The resulting hierarchical topic structures are further refined through a tree-merging module that unifies multiple topic hierarchies into a coherent ontology-like representation. The extracted topics are reviewed by the Subject Matter Experts (SMEs) to filter irrelevant terms and then mapped to Vaccine Ontology, a well-established ontology to assess their relevance and coverage, demonstrating how automated methods can reduce the labor-intensive effort required for manual ontology population.Results: The script is customized to generate a varying number of topics and keywords. In this study, the top 50 topics with 10 keywords per topic were extracted for each chapter of Plotkin's vaccines. The pipeline produced coherent topic clusters representing key themes in vaccinology, including immune mechanisms, pathogen-specific vaccines, and vaccine types. The hierarchical tree-merging process is used to illustrate how semantically related concept groupings emerge and can suggest potential ontology subdivisions. This serves as a visualization of conceptual relationships derived from the data and is particularly helpful for SMEs to review, interpret, and validate candidate concepts.Conclusions: This study demonstrates the feasibility of BERTopic-driven, a semi-automated approach for extracting ontology-relevant concepts from biomedical texts. The method was evaluated using a foundational vaccinology corpus and assessed against an existing, well-developed vaccine ontology to determine the relevance and coverage of the extracted topics. Mapping the topics to the established ontology enabled identification of concept alignments and irrelevant terms, which were subsequently reviewed by SMEs. The results show that the proposed approach can effectively surface meaningful, ontology-relevant concepts while significantly reducing the time and effort for manual population, thereby providing a scalable strategy for supporting ontology maintenance and enrichment.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2026-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-026-00353-w","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Ontologies are essential for structuring biomedical knowledge, supporting semantic integration, reasoning, and data interoperability. In vaccinology, ontology population is particularly critical, as vaccines span diverse domains. A well-defined Vaccine Ontology (VO) enables consistent knowledge representation, integration across datasets, and supports applications such as decision support, literature mining, and semantic search. However, manual ontology population is tedious, time-consuming, and difficult to maintain in this dynamically evolving domain, underscoring the need for automated or semi-automated population approaches.

Methods: We present a semi-automated pipeline that uses Bidirectional Encoder Representations from Transformers and Topic Modeling (BERTopic) to extract ontology-relevant concepts from biomedical text. To evaluate the effectiveness of this automated approach, the method is applied to Plotkin's Vaccines corpus, a leading reference text in vaccinology that synthesizes scientific, clinical, and policy perspectives on vaccines. The workflow integrates multiple natural language processing (NLP) components: document preprocessing with spaCy part-of-speech tagging and vectorization, sentence embeddings generated by a lightweight transformer model (all-MiniLM-L6-v2), dimensionality reduction with Uniform Manifold Approximation and Projection (UMAP), clustering with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and topic representation via Class-based Term Frequency - Inverse Document Frequency (c-TF-IDF). To guide topic discovery toward vaccine-relevant concepts and filter irrelevant terms, the pipeline incorporates a curated set of vaccine-focused terms derived from an existing vaccine ontology as seed words to influence topic representations, while preserving the unsupervised nature of the clustering process. To enhance interpretability, the pipeline employs Keyword extraction using BERT embeddings (KeyBERT) for automatic keyword-based labeling, supplemented with disambiguated descriptive labels, and Bidirectional and Auto-Regressive Transformer (BART) summarization for topic-level summaries. The resulting hierarchical topic structures are further refined through a tree-merging module that unifies multiple topic hierarchies into a coherent ontology-like representation. The extracted topics are reviewed by the Subject Matter Experts (SMEs) to filter irrelevant terms and then mapped to Vaccine Ontology, a well-established ontology to assess their relevance and coverage, demonstrating how automated methods can reduce the labor-intensive effort required for manual ontology population.

Results: The script is customized to generate a varying number of topics and keywords. In this study, the top 50 topics with 10 keywords per topic were extracted for each chapter of Plotkin's vaccines. The pipeline produced coherent topic clusters representing key themes in vaccinology, including immune mechanisms, pathogen-specific vaccines, and vaccine types. The hierarchical tree-merging process is used to illustrate how semantically related concept groupings emerge and can suggest potential ontology subdivisions. This serves as a visualization of conceptual relationships derived from the data and is particularly helpful for SMEs to review, interpret, and validate candidate concepts.

Conclusions: This study demonstrates the feasibility of BERTopic-driven, a semi-automated approach for extracting ontology-relevant concepts from biomedical texts. The method was evaluated using a foundational vaccinology corpus and assessed against an existing, well-developed vaccine ontology to determine the relevance and coverage of the extracted topics. Mapping the topics to the established ontology enabled identification of concept alignments and irrelevant terms, which were subsequently reviewed by SMEs. The results show that the proposed approach can effectively surface meaningful, ontology-relevant concepts while significantly reducing the time and effort for manual population, thereby providing a scalable strategy for supporting ontology maintenance and enrichment.

查看原文本刊更多论文

bertopic驱动的术语从生物医学文本中提取到本体群体：用Plotkin的疫苗语料库评估疫苗本体。

背景：本体对于构建生物医学知识、支持语义集成、推理和数据互操作性至关重要。在疫苗学中，本体群体尤其重要，因为疫苗跨越不同的领域。定义良好的疫苗本体（VO）能够实现一致的知识表示，跨数据集集成，并支持决策支持、文献挖掘和语义搜索等应用。然而，在这个动态发展的领域中，手动的本体填充是乏味的、耗时的，并且难以维护，这强调了对自动化或半自动化填充方法的需求。方法：我们提出了一个半自动管道，它使用来自变形器和主题建模（BERTopic）的双向编码器表示从生物医学文本中提取与本体相关的概念。为了评估这种自动化方法的有效性，将该方法应用于Plotkin的疫苗语料库，这是疫苗学的主要参考文本，综合了疫苗的科学，临床和政策观点。工作流集成了多个自然语言处理（NLP）组件：基于空间词性标注和矢量化的文档预处理、基于轻量级变压器模型（all-MiniLM-L6-v2）生成的句子嵌入、基于均匀流形逼近和投影（UMAP）的降维、基于分层密度的带噪声应用空间聚类（HDBSCAN）的聚类以及基于类的词频-逆文档频率（c-TF-IDF）的主题表示。为了将主题发现导向与疫苗相关的概念并过滤不相关的术语，该管道结合了一组来自现有疫苗本体的以疫苗为中心的术语，作为种子词来影响主题表示，同时保留了聚类过程的无监督性质。为了提高可解释性，管道使用关键字提取，使用BERT嵌入（KeyBERT）进行基于关键字的自动标记，辅以消除歧义的描述性标签，以及双向和自回归变压器（BART）摘要进行主题级摘要。生成的分层主题结构通过树合并模块进一步细化，该模块将多个主题层次结构统一为一致的类似本体的表示。提取的主题由主题专家（sme）审查，以过滤不相关的术语，然后映射到疫苗本体（Vaccine Ontology），这是一个完善的本体，用于评估其相关性和覆盖范围，展示了自动化方法如何减少人工本体填充所需的劳动密集型工作。结果：脚本被定制以生成不同数量的主题和关键字。在本研究中，对Plotkin的疫苗的每一章提取前50个主题，每个主题10个关键词。该管道产生了连贯的主题集群，代表了疫苗学中的关键主题，包括免疫机制、病原体特异性疫苗和疫苗类型。分层树合并过程用于说明语义相关的概念分组是如何出现的，并可以提出潜在的本体细分。这可以作为源自数据的概念关系的可视化，对于中小企业审查、解释和验证候选概念特别有帮助。结论：本研究证明了bertopic驱动的可行性，这是一种从生物医学文本中提取本体相关概念的半自动方法。使用基础疫苗学语料库对该方法进行评估，并对现有的、发展良好的疫苗本体进行评估，以确定提取主题的相关性和覆盖范围。将主题映射到已建立的本体可以识别概念对齐和不相关的术语，这些术语随后由sme进行审查。结果表明，该方法可以有效地呈现有意义的、与本体相关的概念，同时显著减少人工填充的时间和精力，从而为支持本体维护和丰富提供了一种可扩展的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.