Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine.

medRxiv : the preprint server for health sciences Pub Date : 2024-12-10 DOI:10.1101/2024.12.03.24318322

Ruth Johnson, Uri Gottlieb, Galit Shaham, Lihi Eisen, Jacob Waxman, Stav Devons-Sberro, Curtis R Ginder, Peter Hong, Raheel Sayeed, Ben Y Reis, Ran D Balicer, Noa Dagan, Marinka Zitnik

{"title":"Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine.","authors":"Ruth Johnson, Uri Gottlieb, Galit Shaham, Lihi Eisen, Jacob Waxman, Stav Devons-Sberro, Curtis R Ginder, Peter Hong, Raheel Sayeed, Ben Y Reis, Ran D Balicer, Noa Dagan, Marinka Zitnik","doi":"10.1101/2024.12.03.24318322","DOIUrl":null,"url":null,"abstract":"<p><p>Integrating clinical knowledge into AI remains challenging despite numerous medical guidelines and vocabularies. Medical codes, central to healthcare systems, often reflect operational patterns shaped by geographic factors, national policies, insurance frameworks, and physician practices rather than the precise representation of clinical knowledge. This disconnect hampers AI in representing clinical relationships, raising concerns about bias, transparency, and generalizability. Here, we developed a resource of 67,124 clinical vocabulary embeddings derived from a clinical knowledge graph tailored to electronic health record vocabularies, spanning over 1.3 million edges. Using graph transformer neural networks, we generated clinical vocabulary embeddings that provide a new representation of clinical knowledge by unifying seven medical vocabularies. These embeddings were validated through a phenotype risk score analysis involving 4.57 million patients from Clalit Healthcare Services, effectively stratifying individuals based on survival outcomes. Inter-institutional panels of clinicians evaluated the embeddings for alignment with clinical knowledge across 90 diseases and 3,000 clinical codes, confirming their robustness and transferability. This resource addresses gaps in integrating clinical vocabularies into AI models and training datasets, paving the way for knowledge-grounded population and patient-level models.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11643188/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.12.03.24318322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Integrating clinical knowledge into AI remains challenging despite numerous medical guidelines and vocabularies. Medical codes, central to healthcare systems, often reflect operational patterns shaped by geographic factors, national policies, insurance frameworks, and physician practices rather than the precise representation of clinical knowledge. This disconnect hampers AI in representing clinical relationships, raising concerns about bias, transparency, and generalizability. Here, we developed a resource of 67,124 clinical vocabulary embeddings derived from a clinical knowledge graph tailored to electronic health record vocabularies, spanning over 1.3 million edges. Using graph transformer neural networks, we generated clinical vocabulary embeddings that provide a new representation of clinical knowledge by unifying seven medical vocabularies. These embeddings were validated through a phenotype risk score analysis involving 4.57 million patients from Clalit Healthcare Services, effectively stratifying individuals based on survival outcomes. Inter-institutional panels of clinicians evaluated the embeddings for alignment with clinical knowledge across 90 diseases and 3,000 clinical codes, confirming their robustness and transferability. This resource addresses gaps in integrating clinical vocabularies into AI models and training datasets, paving the way for knowledge-grounded population and patient-level models.

查看原文本刊更多论文

统一临床词汇嵌入，提高精准度。

尽管有众多医疗指南和词汇表，但将临床知识融入人工智能仍是一项挑战。医疗代码是医疗保健系统的核心，通常反映的是由地理因素、国家政策、保险框架和医生实践所形成的操作模式，而不是临床知识的精确表述。这种脱节阻碍了人工智能对临床关系的表述，引发了对偏差、透明度和可推广性的担忧。在这里，我们开发了一个由 67124 个临床词汇嵌入组成的资源，这些词汇嵌入来自一个为电子健康记录词汇定制的临床知识图谱，跨越 130 多万条边。利用图转换器神经网络，我们生成了临床词汇嵌入，通过统一七个医学词汇，为临床知识提供了一种新的表示方法。通过对来自 Clalit 医疗保健服务公司的 457 万名患者进行表型风险评分分析，对这些嵌入进行了验证，从而有效地根据生存结果对个人进行分层。由临床医生组成的机构间小组评估了嵌入与 90 种疾病和 3,000 个临床代码的临床知识的一致性，确认了其稳健性和可移植性。这一资源填补了将临床词汇表整合到人工智能模型和训练数据集中的空白，为建立以知识为基础的人群和患者级模型铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv : the preprint server for health sciences

自引率

0.00%

发文量