Applying text-mining to clinical notes: the identification of patient characteristics from electronic health records (EHRs).

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-08-12 DOI:10.1186/s12911-025-03137-x

Simone Ten Hoope, Koen Welvaars, Kylian van Geijtenbeek, Mellanie Klok-Everaars, Sander van Schaik, Fatma Karapinar-Çarkit

{"title":"Applying text-mining to clinical notes: the identification of patient characteristics from electronic health records (EHRs).","authors":"Simone Ten Hoope, Koen Welvaars, Kylian van Geijtenbeek, Mellanie Klok-Everaars, Sander van Schaik, Fatma Karapinar-Çarkit","doi":"10.1186/s12911-025-03137-x","DOIUrl":null,"url":null,"abstract":"Background: Clinical notes contain information on critical patient characteristics, which, if overlooked, could escalate the risk of adverse events as well as miscommunication between the healthcare professional and the patient. This study investigates the feasibility of employing text-mining to extract patient characteristics from Electronic Health Records (EHRs) and compares the effectiveness of text-mining against human intelligence for identifying four patient characteristics: language barrier, living alone, cognitive frailty and non-adherence.Methods: A manual \"golden\" standard was created from 1,120 patient files (878 patients) that had unplanned hospital readmissions. Each patient was categorized in one (or multiple) of the four characteristics with supporting clinical notes extracted from their EHRs. For simple terminology, a rule-based (RB) SQL query was used, and for complex terms, Named Entity Recognition (NER) models were used. Model performance was compared to the manual standard. The primary outcomes were recall, specificity, precision, negative predictive value (NPV) and F1-score.Results: Performance of each patient characteristic was evaluated using a separate train/test dataset. An additional validation dataset was used for the NER models. Within the train/test set, the language barrier RB query achieved a recall of 0.99 (specificity of 0.96). The living alone NER model achieved a recall of 0.86 (specificity of 0.94) on the train/test set and a recall of 0.81 (specificity of 1.00) on the validation set. In that same order, the cognitive frailty model yielded a recall of 0.59 (specificity 0.76) on the train/test set and a recall of 0.73 (specificity 0.96) on the validation set. The NER model for non-adherence achieved a recall of 0.75 (specificity of 0.99) on the train/test set, and a recall of 0.90 (specificity of 0.99) on the validation set. The models showed the tendency to overestimate the presence of patient characteristics such as identifying a family member's language barrier as the patient's.Conclusion: This study successfully demonstrated the feasibility of applying text-mining to identify patient characteristics from EHRs. Also, it seems for more complex terminology, NER models outperform the rule-based option. Future work involves refining these models for broader application in clinical settings.Clinical trial number: Not applicable.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"302"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12344823/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03137-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Clinical notes contain information on critical patient characteristics, which, if overlooked, could escalate the risk of adverse events as well as miscommunication between the healthcare professional and the patient. This study investigates the feasibility of employing text-mining to extract patient characteristics from Electronic Health Records (EHRs) and compares the effectiveness of text-mining against human intelligence for identifying four patient characteristics: language barrier, living alone, cognitive frailty and non-adherence.

Methods: A manual "golden" standard was created from 1,120 patient files (878 patients) that had unplanned hospital readmissions. Each patient was categorized in one (or multiple) of the four characteristics with supporting clinical notes extracted from their EHRs. For simple terminology, a rule-based (RB) SQL query was used, and for complex terms, Named Entity Recognition (NER) models were used. Model performance was compared to the manual standard. The primary outcomes were recall, specificity, precision, negative predictive value (NPV) and F1-score.

Results: Performance of each patient characteristic was evaluated using a separate train/test dataset. An additional validation dataset was used for the NER models. Within the train/test set, the language barrier RB query achieved a recall of 0.99 (specificity of 0.96). The living alone NER model achieved a recall of 0.86 (specificity of 0.94) on the train/test set and a recall of 0.81 (specificity of 1.00) on the validation set. In that same order, the cognitive frailty model yielded a recall of 0.59 (specificity 0.76) on the train/test set and a recall of 0.73 (specificity 0.96) on the validation set. The NER model for non-adherence achieved a recall of 0.75 (specificity of 0.99) on the train/test set, and a recall of 0.90 (specificity of 0.99) on the validation set. The models showed the tendency to overestimate the presence of patient characteristics such as identifying a family member's language barrier as the patient's.

Conclusion: This study successfully demonstrated the feasibility of applying text-mining to identify patient characteristics from EHRs. Also, it seems for more complex terminology, NER models outperform the rule-based option. Future work involves refining these models for broader application in clinical settings.

Clinical trial number: Not applicable.

查看原文本刊更多论文

将文本挖掘应用于临床记录：从电子健康记录（EHRs）中识别患者特征。

背景：临床记录包含患者关键特征的信息，如果忽视这些信息，可能会增加不良事件的风险，以及医护人员和患者之间的误解。本研究探讨了利用文本挖掘从电子健康记录（EHRs）中提取患者特征的可行性，并比较了文本挖掘与人类智能在识别四种患者特征（语言障碍、独居、认知脆弱和不依从性）方面的有效性。方法：从1120例（878例）计划外再入院的患者档案中创建手动“黄金”标准。每个患者被分类为四个特征中的一个（或多个），并从他们的电子病历中提取支持性临床记录。对于简单的术语，使用基于规则的（RB） SQL查询，对于复杂的术语，使用命名实体识别（NER）模型。将模型性能与手动标准进行比较。主要结果为召回率、特异性、精确性、阴性预测值（NPV）和f1评分。结果：使用单独的训练/测试数据集评估每个患者特征的表现。额外的验证数据集用于NER模型。在训练/测试集中，语言障碍RB查询的召回率为0.99（特异性为0.96）。独居NER模型在训练/测试集上的召回率为0.86（特异性为0.94），在验证集上的召回率为0.81（特异性为1.00）。按照同样的顺序，认知脆弱性模型在训练/测试集上的召回率为0.59（特异性0.76），在验证集上的召回率为0.73（特异性0.96）。非依从性的NER模型在训练/测试集上的召回率为0.75（特异性为0.99），在验证集上的召回率为0.90（特异性为0.99）。这些模型显示出高估患者特征存在的趋势，例如将家庭成员的语言障碍识别为患者的语言障碍。结论：本研究成功地证明了应用文本挖掘从电子病历中识别患者特征的可行性。此外，对于更复杂的术语，NER模型似乎优于基于规则的选项。未来的工作包括完善这些模型，以便在临床环境中得到更广泛的应用。临床试验号：不适用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.