用于医疗保健预测任务的诊断代码患者嵌入:Pat2Vec 机器学习框架。

IF 0.9 2区 哲学 0 RELIGION
RELIGION Pub Date : 2023-04-21 DOI:10.2196/40755
Edgar Steiger, Lars Eric Kroll
{"title":"用于医疗保健预测任务的诊断代码患者嵌入:Pat2Vec 机器学习框架。","authors":"Edgar Steiger, Lars Eric Kroll","doi":"10.2196/40755","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient's diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered.</p><p><strong>Objective: </strong>We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network-based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector.</p><p><strong>Methods: </strong>Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care-relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients' diagnoses.</p><p><strong>Results: </strong>Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model's compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data.</p><p><strong>Conclusions: </strong>We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.</p>","PeriodicalId":46717,"journal":{"name":"RELIGION","volume":"46 1","pages":"e40755"},"PeriodicalIF":0.9000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11041498/pdf/","citationCount":"0","resultStr":"{\"title\":\"Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework.\",\"authors\":\"Edgar Steiger, Lars Eric Kroll\",\"doi\":\"10.2196/40755\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient's diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered.</p><p><strong>Objective: </strong>We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network-based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector.</p><p><strong>Methods: </strong>Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care-relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients' diagnoses.</p><p><strong>Results: </strong>Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model's compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data.</p><p><strong>Conclusions: </strong>We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.</p>\",\"PeriodicalId\":46717,\"journal\":{\"name\":\"RELIGION\",\"volume\":\"46 1\",\"pages\":\"e40755\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11041498/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"RELIGION\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/40755\",\"RegionNum\":2,\"RegionCategory\":\"哲学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"RELIGION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"RELIGION","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/40755","RegionNum":2,"RegionCategory":"哲学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"RELIGION","Score":null,"Total":0}
引用次数: 0

摘要

背景:在医疗保健领域,报销数据和电子健康记录(EHR)中的诊断代码在数据驱动的决策制定中发挥着重要作用。任何使用患者诊断代码来预测未来结果或描述发病率的分析都需要由基于字符串的诊断代码组成的诊断概况的数字表示。这些数字表示对机器学习模型尤为重要。最常见的是二进制编码表示法,通常用于诊断的子集。在现实世界的医疗保健应用中,会出现几个问题:即使潜在疾病相同,患者档案也会表现出很高的可变性;患者档案可能存在空白,无法包含所有可用信息;必须考虑大量适当的诊断:我们在此介绍 Pat2Vec,这是一个自我监督的机器学习框架,其灵感来自于基于神经网络的自然语言处理,可将完整的诊断资料嵌入到一个小的实值数字向量中:方法:基于德国门诊索赔数据(诊断代码符合《疾病及相关健康问题国际统计分类》第 10 次修订版(ICD-10)),我们通过贝叶斯优化超参数,发现了患者诊断档案的最佳矢量化嵌入模型。校准过程通过使用不同的机器学习算法(线性回归、逻辑回归以及梯度提升树)汇总不同回归和分类任务的指标,确保为医疗保健相关任务提供稳健的嵌入模型。这些模型与二进制编码最常见诊断的基线模型进行了对比测试。该研究使用了 2016 年至 2019 年期间超过 1,000 万名患者的诊断概况和补充数据,并以德国最大的门诊报销数据集为基础。为了描述医疗保健中的亚群,我们确定了聚类(通过基于密度的聚类),并在二维中可视化了患者向量(通过均匀流形近似的降维)。此外,我们还根据患者的诊断结果,应用我们的矢量化模型预测未来的药物处方成本:结果:我们的最终模型优于等维度的基线模型(二进制编码)。这些模型对缺失数据更加稳健,并显示出巨大的性能提升,尤其是在较低维度上,这证明了嵌入模型对非线性信息的压缩能力。未来,其他来源的医疗保健数据也可以整合到当前基于诊断的框架中。其他研究人员可以将我们公开共享的嵌入模型应用到他们自己的诊断数据中:我们认为 Pat2Vec 可以广泛应用于提高医疗质量的领域,包括个性化预防、患者监控中的信号检测以及基于我们的数据驱动型机器学习框架所识别的子队列的医疗资源规划。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework.

Background: In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient's diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered.

Objective: We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network-based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector.

Methods: Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care-relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients' diagnoses.

Results: Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model's compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data.

Conclusions: We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
RELIGION
RELIGION RELIGION-
CiteScore
2.70
自引率
0.00%
发文量
58
期刊介绍: RELIGION is an internationally recognized peer-reviewed journal, publishing original scholarly research in the comparative and interdisciplinary study of religion. It is published four times annually: two regular issues; and two special issues (or forums) on focused topics, generally under the direction of guest editors. RELIGION is committed to the publication of significant, novel research, review symposia and responses, and survey articles of specific fields and national contributions to scholarship. In addition, the journal includes book reviews and discussions of important venues for the publication of scholarly work in the study of religion.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信