Large language model aided automatic high-throughput drug screening using self-controlled cohort study

medRxiv - Epidemiology Pub Date : 2024-08-05 DOI:10.1101/2024.08.04.24311480

Shenbo Xu, Stan N. Finkelstein, Roy E. Welsch, Kenney Ng, Ioanna Tzoulaki, Lefkos Middleton

{"title":"Large language model aided automatic high-throughput drug screening using self-controlled cohort study","authors":"Shenbo Xu, Stan N. Finkelstein, Roy E. Welsch, Kenney Ng, Ioanna Tzoulaki, Lefkos Middleton","doi":"10.1101/2024.08.04.24311480","DOIUrl":null,"url":null,"abstract":"Background: Developing medicine from scratch to governmental authorization and detecting adverse drug reactions (ADR) have barely been economical, expeditious, and risk-averse investments. The availability of large-scale observational healthcare databases and the popularity of large language models offer an unparalleled opportunity to enable automatic high-throughput drug screening for both repurposing and pharmacovigilance. Objectives: To demonstrate a general workflow for automatic high-throughput drug screening with the following advantages: (i) the association of various exposure on diseases can be estimated; (ii) both repurposing and pharmacovigilance are integrated; (iii) accurate exposure length for each prescription is parsed from clinical texts; (iv) intrinsic relationship between drugs and diseases are removed jointly by bioinformatic mapping and large language model - ChatGPT; (v) causal-wise interpretations for incidence rate contrasts are provided. Methods: Using a self-controlled cohort study design where subjects serve as their own control group, we tested the intention-to-treat association between medications on the incidence of diseases. Exposure length for each prescription is determined by parsing common dosages in English free text into a structured format. Exposure period starts from initial prescription to treatment discontinuation. A same exposure length preceding initial treatment is the control period. Clinical outcomes and categories are identified using existing phenotyping algorithms. Incident rate ratios (IRR) are tested using uniformly most powerful (UMP) unbiased tests. Results: We assessed 3,444 medications on 276 diseases on 6,613,198 patients from the Clinical Practice Research Datalink (CPRD), an UK primary care electronic health records (EHR) spanning from 1987 to 2018. Due to the built-in selection bias of self-controlled cohort studies, ingredients-disease pairs confounded by deterministic medical relationships are removed by existing map from RxNorm and nonexistent maps by calling ChatGPT. A total of 16,901 drug-disease pairs reveals significant risk reduction, which can be considered as candidates for repurposing, while a total of 11,089 pairs showed significant risk increase, where drug safety might be of a concern instead. Conclusions: This work developed a data-driven, nonparametric, hypothesis generating, and automatic high-throughput workflow, which reveals the potential of natural language processing in pharmacoepidemiology. We demonstrate the paradigm to a large observational health dataset to help discover potential novel therapies and adverse drug effects. The framework of this study can be extended to other observational medical databases.","PeriodicalId":501071,"journal":{"name":"medRxiv - Epidemiology","volume":"104 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.04.24311480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Developing medicine from scratch to governmental authorization and detecting adverse drug reactions (ADR) have barely been economical, expeditious, and risk-averse investments. The availability of large-scale observational healthcare databases and the popularity of large language models offer an unparalleled opportunity to enable automatic high-throughput drug screening for both repurposing and pharmacovigilance. Objectives: To demonstrate a general workflow for automatic high-throughput drug screening with the following advantages: (i) the association of various exposure on diseases can be estimated; (ii) both repurposing and pharmacovigilance are integrated; (iii) accurate exposure length for each prescription is parsed from clinical texts; (iv) intrinsic relationship between drugs and diseases are removed jointly by bioinformatic mapping and large language model - ChatGPT; (v) causal-wise interpretations for incidence rate contrasts are provided. Methods: Using a self-controlled cohort study design where subjects serve as their own control group, we tested the intention-to-treat association between medications on the incidence of diseases. Exposure length for each prescription is determined by parsing common dosages in English free text into a structured format. Exposure period starts from initial prescription to treatment discontinuation. A same exposure length preceding initial treatment is the control period. Clinical outcomes and categories are identified using existing phenotyping algorithms. Incident rate ratios (IRR) are tested using uniformly most powerful (UMP) unbiased tests. Results: We assessed 3,444 medications on 276 diseases on 6,613,198 patients from the Clinical Practice Research Datalink (CPRD), an UK primary care electronic health records (EHR) spanning from 1987 to 2018. Due to the built-in selection bias of self-controlled cohort studies, ingredients-disease pairs confounded by deterministic medical relationships are removed by existing map from RxNorm and nonexistent maps by calling ChatGPT. A total of 16,901 drug-disease pairs reveals significant risk reduction, which can be considered as candidates for repurposing, while a total of 11,089 pairs showed significant risk increase, where drug safety might be of a concern instead. Conclusions: This work developed a data-driven, nonparametric, hypothesis generating, and automatic high-throughput workflow, which reveals the potential of natural language processing in pharmacoepidemiology. We demonstrate the paradigm to a large observational health dataset to help discover potential novel therapies and adverse drug effects. The framework of this study can be extended to other observational medical databases.

查看原文本刊更多论文

利用自控队列研究进行大语言模型辅助自动高通量药物筛选

背景：从零开始开发药品到获得政府授权和检测药物不良反应（ADR），一直以来都是经济、快速和规避风险的投资。大规模观察性医疗保健数据库的可用性和大型语言模型的普及为实现自动高通量药物筛选提供了无与伦比的机会，既可用于再利用，也可用于药物警戒。目标：演示自动高通量药物筛选的一般工作流程，该流程具有以下优势：(i) 可估算各种暴露与疾病的关联；(ii) 可整合再利用和药物警戒；(iii) 可从临床文本中解析每个处方的准确暴露长度；(iv) 可通过生物信息学映射和大型语言模型--ChatGPT 共同去除药物与疾病之间的内在关系；(v) 可提供发病率对比的因果解释。方法：我们采用自控队列研究设计，将受试者作为自己的对照组，测试了药物与疾病发病率之间的意向治疗关联。通过将英文自由文本中的常用剂量解析为结构化格式，确定了每个处方的暴露长度。暴露期从首次处方开始到治疗终止。初始治疗前的相同暴露期为对照期。使用现有的表型算法确定临床结果和类别。使用均匀最强（UMP）无偏检验法检验发病率比（IRR）。结果：我们对临床实践研究数据链（CPRD）中的 6,613,198 名患者的 276 种疾病的 3,444 种药物进行了评估，该数据链是英国的初级保健电子健康记录（EHR），时间跨度为 1987 年至 2018 年。由于自控队列研究的内在选择偏差，由确定性医疗关系混淆的成分-疾病对被 RxNorm 的现有映射和调用 ChatGPT 的不存在映射剔除。共有 16,901 组药物-疾病配对显示风险显著降低，可作为再利用的候选药物；共有 11,089 组药物-疾病配对显示风险显著升高，药物安全性可能会受到关注。结论这项研究开发了一种数据驱动、非参数、假设生成和自动的高通量工作流程，揭示了自然语言处理在药物流行病学中的潜力。我们在一个大型健康观察数据集上演示了这一范例，以帮助发现潜在的新型疗法和药物不良反应。本研究的框架可扩展到其他观察性医疗数据库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Epidemiology

自引率

0.00%

发文量