Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Biomedical Semantics Pub Date : 2025-05-23 DOI:10.1186/s13326-025-00331-8

Zhigang Wang, Xingxian Li, Jie Zheng, Yongqun He

{"title":"Unveiling differential adverse event profiles in vaccines via LLM text embeddings and ontology semantic analysis.","authors":"Zhigang Wang, Xingxian Li, Jie Zheng, Yongqun He","doi":"10.1186/s13326-025-00331-8","DOIUrl":null,"url":null,"abstract":"Background: Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.Results: We used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine's AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin's method. Text embeddings were generated for each vaccine's AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is \"Live\" or \"Non-Live\". The term \"Non-Live\" refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.Conclusion: This study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"16 1","pages":"10"},"PeriodicalIF":2.0000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12102970/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-025-00331-8","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.

Results: We used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine's AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin's method. Text embeddings were generated for each vaccine's AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is "Live" or "Non-Live". The term "Non-Live" refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.

Conclusion: This study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.

Abstract Image

查看原文本刊更多论文

通过LLM文本嵌入和本体语义分析揭示疫苗的不同不良事件概况。

背景：疫苗对预防传染病至关重要；然而，它们也可能与不良事件（ae）有关。传统的疫苗ae分析依赖于人工审查和将ae分配给术语或本体中的术语，这是一个耗时且范围有限的过程。本研究探索了使用大型语言模型（LLM）和LLM文本嵌入进行有效和全面的疫苗AE分析的潜力。结果：我们使用Llama-3 LLM从fda批准的111种疫苗说明书中提取AE信息，其中包括15种流感疫苗。然后使用nomic-embed-text和mxbai-embed-large模型为每种疫苗的ae生成文本嵌入。Llama-3从疫苗包装说明书中提取AE文本的准确率达到80%以上。为了进一步评估文本嵌入的性能，采用两种聚类方法对疫苗进行聚类：(1)基于LLM文本嵌入的聚类和(2)基于本体的语义相似度分析。基于本体的方法将ae映射到人类表型本体（Human Phenotype Ontology， HPO）和不良事件本体（Ontology of Adverse Events， OAE），并使用Lin的方法分析语义相似度。使用LLM nomic-embed-text和mxbai-embed-large模型为每种疫苗的AE描述生成文本嵌入。与语义相似度分析相比，LLM方法能够捕获更多不同的声发射特征。此外，法学硕士衍生的文本嵌入用于开发Lasso逻辑回归模型，以预测疫苗是“活的”还是“非活的”。“非活”一词是指不含活生物体的所有疫苗，包括灭活疫苗和mRNA疫苗。对比分析表明，尽管相似的聚类模式，nomic-embed-text模型优于另一种。在10倍交叉验证中，灵敏度为80.00%，特异性为83.06%，准确度为81.89%。通过对AE LLM嵌入的分析，我们确定了许多AE模式，并给出了示例。结论：本研究证明了LLM在自动声发射提取和分析方面的有效性，LLM文本嵌入捕获了声发射的潜在信息，从而实现了更全面的知识发现。我们的研究结果表明，法学硕士在改善疫苗安全和公共卫生研究方面具有巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.