Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-06-19 DOI:10.1016/j.jbi.2025.104865

Navya Martin Kollapally , James Geller , Vipina Kuttichi Keloth , Zhe He , Julia Xu

{"title":"Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement","authors":"Navya Martin Kollapally , James Geller , Vipina Kuttichi Keloth , Zhe He , Julia Xu","doi":"10.1016/j.jbi.2025.104865","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Ontologies are essential for representing the knowledge of a domain. To make ontologies useful, they must encompass a comprehensive domain view. To achieve ontology enrichment, there is a need to discover new concepts to be added, either because they were missed in the first place, or the state-of-the-art has advanced to develop new real-world concepts. Our goal is to develop an automatic enrichment pipeline using a seed ontology, a Large Language Model (LLM), and source of text. The pipeline is applied to the domain of Social Determinants of Health (SDoH), using PubMed as a source of concepts. In this work, the applicability and effectiveness of the enrichment pipeline is demonstrated by extending the SDoH Ontology called SOHOv1, however our methodology could be used in other domains as well.</div></div><div><h3>Methods</h3><div>We first retrieved PubMed abstracts of candidate articles with existing SOHOv1 concepts as search terms. Next, we used GPT-4-1201 to extract semantic triples from the abstracts. We identified concepts from these triples utilizing lexical, semantic, and knowledge network-based filtering. We also compared the granularity of semantic triples extracted with our method to the triples in the SemMedDB (Semantic MEDLINE Database). The results were evaluated by human experts and standard ontology tools for checking consistency and semantic correctness.</div></div><div><h3>Results</h3><div>We expanded SOHOv1, which contained 173 concepts and 585 axioms, including 207 logical axioms to SOHOv2, which contains 572 concepts, 1,542 axioms, including 725 logical axioms. Our methods identified more concepts than those extracted from SemMedDB for the same task. While we have shown the feasibility of our approach for an SDoH ontology, the methodology is generalizable to other ontologies with an existing seed ontology and text corpus.</div></div><div><h3>Conclusions</h3><div>The contributions of this work are: Extracting semantic triples from PubMed abstracts using GPT-4-1201 utilizing <em>prompt chaining</em>; showing the superiority of triples from GPT-4-1201 over triples from SemMedDB for SDoH; using lexical and semantic similarity search techniques with knowledge network-based search to identify the concepts to be added to the ontology; confirming the quality of the new concepts with human experts.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"168 ","pages":"Article 104865"},"PeriodicalIF":4.5000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425000942","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Ontologies are essential for representing the knowledge of a domain. To make ontologies useful, they must encompass a comprehensive domain view. To achieve ontology enrichment, there is a need to discover new concepts to be added, either because they were missed in the first place, or the state-of-the-art has advanced to develop new real-world concepts. Our goal is to develop an automatic enrichment pipeline using a seed ontology, a Large Language Model (LLM), and source of text. The pipeline is applied to the domain of Social Determinants of Health (SDoH), using PubMed as a source of concepts. In this work, the applicability and effectiveness of the enrichment pipeline is demonstrated by extending the SDoH Ontology called SOHOv1, however our methodology could be used in other domains as well.

Methods

We first retrieved PubMed abstracts of candidate articles with existing SOHOv1 concepts as search terms. Next, we used GPT-4-1201 to extract semantic triples from the abstracts. We identified concepts from these triples utilizing lexical, semantic, and knowledge network-based filtering. We also compared the granularity of semantic triples extracted with our method to the triples in the SemMedDB (Semantic MEDLINE Database). The results were evaluated by human experts and standard ontology tools for checking consistency and semantic correctness.

Results

We expanded SOHOv1, which contained 173 concepts and 585 axioms, including 207 logical axioms to SOHOv2, which contains 572 concepts, 1,542 axioms, including 725 logical axioms. Our methods identified more concepts than those extracted from SemMedDB for the same task. While we have shown the feasibility of our approach for an SDoH ontology, the methodology is generalizable to other ontologies with an existing seed ontology and text corpus.

Conclusions

The contributions of this work are: Extracting semantic triples from PubMed abstracts using GPT-4-1201 utilizing prompt chaining; showing the superiority of triples from GPT-4-1201 over triples from SemMedDB for SDoH; using lexical and semantic similarity search techniques with knowledge network-based search to identify the concepts to be added to the ontology; confirming the quality of the new concepts with human experts.

Abstract Image

查看原文本刊更多论文

使用大型语言模型丰富本体：为概念放置应用基于词汇、语义和知识网络的相似性。

目的：本体对于表示一个领域的知识是必不可少的。为了使本体有用，它们必须包含一个全面的域视图。为了实现本体的丰富，需要发现要添加的新概念，要么是因为它们一开始就被遗漏了，要么是因为先进的技术已经发展到开发新的现实世界概念。我们的目标是开发一个使用种子本体、大型语言模型（LLM）和文本源的自动充实管道。该管道应用于健康的社会决定因素（SDoH）领域，使用PubMed作为概念来源。在这项工作中，通过扩展称为SOHOv1的SDoH本体来证明浓缩管道的适用性和有效性，但是我们的方法也可以用于其他领域。方法：我们首先以现有SOHOv1概念作为搜索词检索候选文章的PubMed摘要。接下来，我们使用GPT-4-1201从摘要中提取语义三元组。我们利用基于词汇、语义和知识网络的过滤从这些三元组中识别概念。我们还将用我们的方法提取的语义三元组的粒度与SemMedDB （semantic MEDLINE数据库）中的三元组进行了比较。结果由人类专家和标准本体工具进行评估，以检查一致性和语义正确性。结果：我们将包含173个概念和585个公理的SOHOv1扩展到包含572个概念和1542个公理的SOHOv2，其中包含207个逻辑公理。对于相同的任务，我们的方法比从SemMedDB中提取的方法识别出更多的概念。虽然我们已经证明了我们的方法对于SDoH本体的可行性，但该方法可以推广到具有现有种子本体和文本语料库的其他本体。结论：本工作的贡献在于：使用GPT-4-1201利用提示链从PubMed摘要中提取语义三元组；GPT-4-1201的三元组优于SemMedDB的三元组；利用基于知识网络的词汇和语义相似度搜索技术识别待添加到本体中的概念；与人类专家确认新概念的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.