利用BERT从大规模心血管EMR数据中嵌入ICD代码,以了解患者的诊断模式。

IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS
Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Boeun Choi, Tae Joon Jun, Young-Hak Kim
{"title":"利用BERT从大规模心血管EMR数据中嵌入ICD代码,以了解患者的诊断模式。","authors":"Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Boeun Choi, Tae Joon Jun, Young-Hak Kim","doi":"10.1186/s12911-025-03145-x","DOIUrl":null,"url":null,"abstract":"<p><p>The integration of electronic medical records (EMRs) with artificial intelligence (AI) is enhancing medical research, particularly in real-world evidence (RWE) studies. Extracting insights from coded medical data, such as ICD-10 codes, is essential for patient characterization. Traditional techniques, such as one-hot encoding (OHE), face limitations, particularly in managing high-dimensional data. In this study, a Bidirectional Encoder Representations from Transformers (BERT) approach is introduced to encode ICD-10 diagnostic codes, significantly improving model performance and reducing dimensionality. Data from 495,269 patients who visited the Cardiology Department at Asan Medical Center between 2000 and 2020 were used. The performance of models trained with OHE and ClinicalBERT embeddings was compared. For predicting major adverse cardiovascular events within one year following percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), the ClinicalBERT (code-embedded) model outperformed OHE. It achieved an AUC of 0.746 compared to 0.719, while also significantly reducing the dimensionality from 2,492 to 128. This method, which integrates diagnostic and medication data, provides valuable insights into patient care, enhancing the precision of predictions and supporting healthcare professionals in making more informed decisions.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"300"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337513/pdf/","citationCount":"0","resultStr":"{\"title\":\"Leveraging BERT for embedding ICD codes from large scale cardiovascular EMR data to understand patient diagnostic patterns.\",\"authors\":\"Minkyoung Kim, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, JiYe Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Boeun Choi, Tae Joon Jun, Young-Hak Kim\",\"doi\":\"10.1186/s12911-025-03145-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The integration of electronic medical records (EMRs) with artificial intelligence (AI) is enhancing medical research, particularly in real-world evidence (RWE) studies. Extracting insights from coded medical data, such as ICD-10 codes, is essential for patient characterization. Traditional techniques, such as one-hot encoding (OHE), face limitations, particularly in managing high-dimensional data. In this study, a Bidirectional Encoder Representations from Transformers (BERT) approach is introduced to encode ICD-10 diagnostic codes, significantly improving model performance and reducing dimensionality. Data from 495,269 patients who visited the Cardiology Department at Asan Medical Center between 2000 and 2020 were used. The performance of models trained with OHE and ClinicalBERT embeddings was compared. For predicting major adverse cardiovascular events within one year following percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), the ClinicalBERT (code-embedded) model outperformed OHE. It achieved an AUC of 0.746 compared to 0.719, while also significantly reducing the dimensionality from 2,492 to 128. This method, which integrates diagnostic and medication data, provides valuable insights into patient care, enhancing the precision of predictions and supporting healthcare professionals in making more informed decisions.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"25 1\",\"pages\":\"300\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337513/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-025-03145-x\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03145-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

电子病历(emr)与人工智能(AI)的集成正在加强医学研究,特别是在现实世界证据(RWE)研究中。从编码的医疗数据(如ICD-10代码)中提取见解对于患者特征描述至关重要。传统技术,如单热编码(OHE),面临着局限性,特别是在管理高维数据方面。在本研究中,引入了一种来自变压器的双向编码器表示(BERT)方法来编码ICD-10诊断代码,显著提高了模型性能并降低了维数。研究对象是2000 ~ 2020年在峨山医院心内科就诊的49.5269万名患者。比较了用OHE和ClinicalBERT嵌入训练的模型的性能。对于经皮冠状动脉介入治疗(PCI)或冠状动脉旁路移植术(CABG)后一年内主要不良心血管事件的预测,ClinicalBERT(代码嵌入)模型优于OHE。与0.719相比,它实现了0.746的AUC,同时也显着将维数从2,492降至128。这种方法集成了诊断和药物数据,为患者护理提供了有价值的见解,提高了预测的准确性,并支持医疗保健专业人员做出更明智的决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Leveraging BERT for embedding ICD codes from large scale cardiovascular EMR data to understand patient diagnostic patterns.

The integration of electronic medical records (EMRs) with artificial intelligence (AI) is enhancing medical research, particularly in real-world evidence (RWE) studies. Extracting insights from coded medical data, such as ICD-10 codes, is essential for patient characterization. Traditional techniques, such as one-hot encoding (OHE), face limitations, particularly in managing high-dimensional data. In this study, a Bidirectional Encoder Representations from Transformers (BERT) approach is introduced to encode ICD-10 diagnostic codes, significantly improving model performance and reducing dimensionality. Data from 495,269 patients who visited the Cardiology Department at Asan Medical Center between 2000 and 2020 were used. The performance of models trained with OHE and ClinicalBERT embeddings was compared. For predicting major adverse cardiovascular events within one year following percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG), the ClinicalBERT (code-embedded) model outperformed OHE. It achieved an AUC of 0.746 compared to 0.719, while also significantly reducing the dimensionality from 2,492 to 128. This method, which integrates diagnostic and medication data, provides valuable insights into patient care, enhancing the precision of predictions and supporting healthcare professionals in making more informed decisions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信