从生物医学BERT模型中嵌入的参数知识预测药物副作用关系：使用自然语言处理方法的方法学研究。

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-07-10 DOI:10.2196/67513

Woohyuk Jeon, Minjae Park, Doyeon An, Wonshik Nam, Ju-Young Shin, Seunghee Lee, Suehyun Lee

{"title":"从生物医学BERT模型中嵌入的参数知识预测药物副作用关系：使用自然语言处理方法的方法学研究。","authors":"Woohyuk Jeon, Minjae Park, Doyeon An, Wonshik Nam, Ju-Young Shin, Seunghee Lee, Suehyun Lee","doi":"10.2196/67513","DOIUrl":null,"url":null,"abstract":"Background: Adverse drug reactions (ADRs) pose serious risks to patient health, and effectively predicting and managing them is an important public health challenge. Given the complexity and specificity of biomedical text data, the traditional context-independent word embedding model, Word2Vec, has limitations in fully reflecting the domain specificity of such data. Although Bidirectional Encoder Representations from Transformers (BERT)-based models pretrained on biomedical corpora have demonstrated high performance in ADR-related studies, research using these models to predict previously unknown drug-side effect relationships remains insufficient.Objective: This study proposes a method for predicting drug-side effect relationships by leveraging the parametric knowledge embedded in biomedical BERT models. Through this approach, we predict promising candidates for potential drug-side effect relationships with unknown causal mechanisms by leveraging parametric knowledge from biomedical BERT models and embedding vector similarities of known relationships.Methods: We used 158,096 pairs of drug-side effect relationships from the side effect resource (SIDER) database to generate an adjacency matrix and calculate the cosine similarity between word embedding vectors of drugs and side effects. Relation scores were calculated for 8,235,435 drug-side effect pairs using this similarity. To evaluate the prediction accuracy of drug-side effect relationships, the area under the curve (AUC) value was measured using the calculated relation score and 158,096 known drug-side effect relationships from SIDER.Results: The clagator/biobert_v1.1 model achieved an AUC of 0.915 at an optimal threshold of 0.289, outperforming the existing Word2Vec model with an AUC of 0.848. The BERT-based models pretrained on the biomedical corpus outperformed the vanilla BERT model with an AUC of 0.857. External validation with the FDA (Food and Drug Administration) Adverse Event Reporting System data, using Fisher exact test based on 8,235,435 predicted drug-side effect pairs and 901,361 known relationships, confirmed high statistical significance (P<.001) with an odds ratio of 4.822. In addition, a literature review of predicted drug-side effect relationships not confirmed in the SIDER database revealed that these relationships have been reported in recent studies published after 2016.Conclusions: This study introduces a method for extracting drug-side effect relationships embedded in parameters of language models pretrained on biomedical corpora and using this information to predict previously unknown drug-side effect relationships. We found that BERT-based models pretrained with biomedical corpora consider contextual information and achieve better performance in drug-side effect relationship prediction. External validation using the FDA Adverse Event Reporting System dataset and the literature review of certain cases confirmed high statistical significance, demonstrating practical applicability. These results highlight the utility of natural language processing-based approaches for predicting and managing ADR.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e67513"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12287980/pdf/","citationCount":"0","resultStr":"{\"title\":\"Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.\",\"authors\":\"Woohyuk Jeon, Minjae Park, Doyeon An, Wonshik Nam, Ju-Young Shin, Seunghee Lee, Suehyun Lee\",\"doi\":\"10.2196/67513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Adverse drug reactions (ADRs) pose serious risks to patient health, and effectively predicting and managing them is an important public health challenge. Given the complexity and specificity of biomedical text data, the traditional context-independent word embedding model, Word2Vec, has limitations in fully reflecting the domain specificity of such data. Although Bidirectional Encoder Representations from Transformers (BERT)-based models pretrained on biomedical corpora have demonstrated high performance in ADR-related studies, research using these models to predict previously unknown drug-side effect relationships remains insufficient.Objective: This study proposes a method for predicting drug-side effect relationships by leveraging the parametric knowledge embedded in biomedical BERT models. Through this approach, we predict promising candidates for potential drug-side effect relationships with unknown causal mechanisms by leveraging parametric knowledge from biomedical BERT models and embedding vector similarities of known relationships.Methods: We used 158,096 pairs of drug-side effect relationships from the side effect resource (SIDER) database to generate an adjacency matrix and calculate the cosine similarity between word embedding vectors of drugs and side effects. Relation scores were calculated for 8,235,435 drug-side effect pairs using this similarity. To evaluate the prediction accuracy of drug-side effect relationships, the area under the curve (AUC) value was measured using the calculated relation score and 158,096 known drug-side effect relationships from SIDER.Results: The clagator/biobert_v1.1 model achieved an AUC of 0.915 at an optimal threshold of 0.289, outperforming the existing Word2Vec model with an AUC of 0.848. The BERT-based models pretrained on the biomedical corpus outperformed the vanilla BERT model with an AUC of 0.857. External validation with the FDA (Food and Drug Administration) Adverse Event Reporting System data, using Fisher exact test based on 8,235,435 predicted drug-side effect pairs and 901,361 known relationships, confirmed high statistical significance (P<.001) with an odds ratio of 4.822. In addition, a literature review of predicted drug-side effect relationships not confirmed in the SIDER database revealed that these relationships have been reported in recent studies published after 2016.Conclusions: This study introduces a method for extracting drug-side effect relationships embedded in parameters of language models pretrained on biomedical corpora and using this information to predict previously unknown drug-side effect relationships. We found that BERT-based models pretrained with biomedical corpora consider contextual information and achieve better performance in drug-side effect relationship prediction. External validation using the FDA Adverse Event Reporting System dataset and the literature review of certain cases confirmed high statistical significance, demonstrating practical applicability. These results highlight the utility of natural language processing-based approaches for predicting and managing ADR.\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e67513\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12287980/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/67513\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/67513","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：药物不良反应（adr）对患者健康构成严重威胁，有效预测和管理adr是一项重要的公共卫生挑战。鉴于生物医学文本数据的复杂性和特殊性，传统的上下文无关词嵌入模型Word2Vec在充分反映此类数据的领域特殊性方面存在局限性。尽管在生物医学语料库上预训练的基于变形金刚（BERT）模型的双向编码器表示在adr相关研究中表现优异，但使用这些模型预测先前未知的药物副作用关系的研究仍然不足。目的：本研究提出了一种利用生物医学BERT模型中嵌入的参数知识来预测药物副作用关系的方法。通过这种方法，我们利用生物医学BERT模型的参数化知识和嵌入已知关系的向量相似性，预测具有未知因果机制的潜在药物副作用关系的有希望的候选药物。方法：利用副作用资源（SIDER）数据库中的158096对药物副作用关系生成邻接矩阵，计算药物与副作用词嵌入向量之间的余弦相似度。使用这种相似性计算8,235,435对药物副作用的关联得分。为了评估药物副作用关系的预测准确性，利用计算出的关系评分和SIDER中已知的158,096种药物副作用关系，测量曲线下面积（AUC）值。结果：clagator/biobert_v1.1模型在0.289的最优阈值下的AUC为0.915，优于现有的Word2Vec模型的AUC为0.848。在生物医学语料库上预训练的BERT模型优于普通BERT模型，AUC为0.857。FDA不良事件报告系统数据的外部验证，使用Fisher精确检验，基于8,235,435对预测的药物副作用和901,361种已知关系，证实了高统计显著性(p结论：本研究介绍了一种提取嵌入在生物医学语料库预训练语言模型参数中的药物副作用关系的方法，并利用该信息预测先前未知的药物副作用关系。我们发现，使用生物医学语料库预训练的基于bert的模型考虑了上下文信息，在药物副作用关系预测方面取得了更好的效果。使用FDA不良事件报告系统数据集的外部验证和对某些病例的文献回顾证实了高统计显著性，证明了实际适用性。这些结果突出了基于自然语言处理的ADR预测和管理方法的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.

查看原文本刊更多论文

Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.

Background: Adverse drug reactions (ADRs) pose serious risks to patient health, and effectively predicting and managing them is an important public health challenge. Given the complexity and specificity of biomedical text data, the traditional context-independent word embedding model, Word2Vec, has limitations in fully reflecting the domain specificity of such data. Although Bidirectional Encoder Representations from Transformers (BERT)-based models pretrained on biomedical corpora have demonstrated high performance in ADR-related studies, research using these models to predict previously unknown drug-side effect relationships remains insufficient.

Objective: This study proposes a method for predicting drug-side effect relationships by leveraging the parametric knowledge embedded in biomedical BERT models. Through this approach, we predict promising candidates for potential drug-side effect relationships with unknown causal mechanisms by leveraging parametric knowledge from biomedical BERT models and embedding vector similarities of known relationships.

Methods: We used 158,096 pairs of drug-side effect relationships from the side effect resource (SIDER) database to generate an adjacency matrix and calculate the cosine similarity between word embedding vectors of drugs and side effects. Relation scores were calculated for 8,235,435 drug-side effect pairs using this similarity. To evaluate the prediction accuracy of drug-side effect relationships, the area under the curve (AUC) value was measured using the calculated relation score and 158,096 known drug-side effect relationships from SIDER.

Results: The clagator/biobert_v1.1 model achieved an AUC of 0.915 at an optimal threshold of 0.289, outperforming the existing Word2Vec model with an AUC of 0.848. The BERT-based models pretrained on the biomedical corpus outperformed the vanilla BERT model with an AUC of 0.857. External validation with the FDA (Food and Drug Administration) Adverse Event Reporting System data, using Fisher exact test based on 8,235,435 predicted drug-side effect pairs and 901,361 known relationships, confirmed high statistical significance (P<.001) with an odds ratio of 4.822. In addition, a literature review of predicted drug-side effect relationships not confirmed in the SIDER database revealed that these relationships have been reported in recent studies published after 2016.

Conclusions: This study introduces a method for extracting drug-side effect relationships embedded in parameters of language models pretrained on biomedical corpora and using this information to predict previously unknown drug-side effect relationships. We found that BERT-based models pretrained with biomedical corpora consider contextual information and achieve better performance in drug-side effect relationship prediction. External validation using the FDA Adverse Event Reporting System dataset and the literature review of certain cases confirmed high statistical significance, demonstrating practical applicability. These results highlight the utility of natural language processing-based approaches for predicting and managing ADR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.