提取慢性疾病共病患者的多面特征：使用大语言模型的框架开发。

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-05-15 DOI:10.2196/70096

Junyan Zhang, Junchen Zhou, Liqin Zhou, Zhichao Ba

{"title":"提取慢性疾病共病患者的多面特征：使用大语言模型的框架开发。","authors":"Junyan Zhang, Junchen Zhou, Liqin Zhou, Zhichao Ba","doi":"10.2196/70096","DOIUrl":null,"url":null,"abstract":"Background: Research on chronic multimorbidity has increasingly become a focal point with the aging of the population. Many studies in this area require detailed patient characteristic information. However, the current methods for extracting such information are complex, time-consuming, and prone to errors. The challenge of quickly and accurately extracting patient characteristics has become a common issue in the study of chronic disease comorbidities.Objective: Our objective was to establish a comprehensive framework for extracting demographic and disease characteristics of patients with multimorbidity. This framework leverages large language models (LLMs) to extract feature information from unstructured and semistructured electronic health records pertaining to these patients. We investigated the model's proficiency in extracting feature information across 7 dimensions: basic information, disease details, lifestyle habits, family medical history, symptom history, medication recommendations, and dietary advice. In addition, we demonstrated the strengths and limitations of this framework.Methods: We used data sourced from a grassroots community health service center in China. We developed a multifaceted feature extraction framework tailored for patients with multimorbidity, which consists of several integral components: feasibility testing, preprocessing, the determination of feature extraction, prompt modeling based on LLMs, postprocessing, and midterm evaluation. Within this framework, 7 types of feature information were extracted as straightforward features, and three types of features were identified as intricate features. On the basis of the straightforward features, we calculated patients' age, BMI, and 12 disease risk factors. Rigorous manual verification experiments were conducted 100 times for straightforward features and 200 times for intricate features, followed by comprehensive quantitative and qualitative assessments of the experimental outcomes.Results: The framework achieved an overall F1-score of 99.6% for the 7 straightforward feature extractions, with the highest F1-score of 100% for basic information. In addition, the framework demonstrated an overall F1-score of 94.4% for the 3 intricate feature extractions. Our analysis of the results revealed that accurate information content extraction is a substantially advantage of this framework, whereas ensuring consistency in the format of extracted information remains one of its challenges.Conclusions: The framework incorporates electronic health record information from 1225 patients with multimorbidity, covering a diverse range of 41 chronic diseases, and can seamlessly accommodate the inclusion of additional diseases. This underscores its scalability and adaptability as a method for extracting patient-specific characteristics, effectively addressing the challenges associated with information extraction in the context of multidisease research. Research and medical policy personnel can extract feature information by setting corresponding goals based on the research objectives and directly using the LLM for zero-sample target feature extraction. This approach greatly improves research efficiency and reduces labor requirements; moreover, due to the framework's high accuracy, it can increase study reliability.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e70096"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123238/pdf/","citationCount":"0","resultStr":"{\"title\":\"Extracting Multifaceted Characteristics of Patients With Chronic Disease Comorbidity: Framework Development Using Large Language Models.\",\"authors\":\"Junyan Zhang, Junchen Zhou, Liqin Zhou, Zhichao Ba\",\"doi\":\"10.2196/70096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Research on chronic multimorbidity has increasingly become a focal point with the aging of the population. Many studies in this area require detailed patient characteristic information. However, the current methods for extracting such information are complex, time-consuming, and prone to errors. The challenge of quickly and accurately extracting patient characteristics has become a common issue in the study of chronic disease comorbidities.Objective: Our objective was to establish a comprehensive framework for extracting demographic and disease characteristics of patients with multimorbidity. This framework leverages large language models (LLMs) to extract feature information from unstructured and semistructured electronic health records pertaining to these patients. We investigated the model's proficiency in extracting feature information across 7 dimensions: basic information, disease details, lifestyle habits, family medical history, symptom history, medication recommendations, and dietary advice. In addition, we demonstrated the strengths and limitations of this framework.Methods: We used data sourced from a grassroots community health service center in China. We developed a multifaceted feature extraction framework tailored for patients with multimorbidity, which consists of several integral components: feasibility testing, preprocessing, the determination of feature extraction, prompt modeling based on LLMs, postprocessing, and midterm evaluation. Within this framework, 7 types of feature information were extracted as straightforward features, and three types of features were identified as intricate features. On the basis of the straightforward features, we calculated patients' age, BMI, and 12 disease risk factors. Rigorous manual verification experiments were conducted 100 times for straightforward features and 200 times for intricate features, followed by comprehensive quantitative and qualitative assessments of the experimental outcomes.Results: The framework achieved an overall F1-score of 99.6% for the 7 straightforward feature extractions, with the highest F1-score of 100% for basic information. In addition, the framework demonstrated an overall F1-score of 94.4% for the 3 intricate feature extractions. Our analysis of the results revealed that accurate information content extraction is a substantially advantage of this framework, whereas ensuring consistency in the format of extracted information remains one of its challenges.Conclusions: The framework incorporates electronic health record information from 1225 patients with multimorbidity, covering a diverse range of 41 chronic diseases, and can seamlessly accommodate the inclusion of additional diseases. This underscores its scalability and adaptability as a method for extracting patient-specific characteristics, effectively addressing the challenges associated with information extraction in the context of multidisease research. Research and medical policy personnel can extract feature information by setting corresponding goals based on the research objectives and directly using the LLM for zero-sample target feature extraction. This approach greatly improves research efficiency and reduces labor requirements; moreover, due to the framework's high accuracy, it can increase study reliability.\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e70096\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123238/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/70096\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/70096","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：随着人口老龄化的加剧，慢性多病的研究日益成为人们关注的焦点。该领域的许多研究需要详细的患者特征信息。然而，目前提取此类信息的方法复杂、耗时且容易出错。快速准确地提取患者特征已成为慢性疾病合并症研究中的一个共同问题。目的：我们的目的是建立一个综合的框架来提取多病患者的人口学和疾病特征。该框架利用大型语言模型（llm）从与这些患者相关的非结构化和半结构化电子健康记录中提取特征信息。我们考察了该模型在7个维度上提取特征信息的熟练程度：基本信息、疾病细节、生活习惯、家族病史、症状史、药物建议和饮食建议。此外，我们还展示了该框架的优点和局限性。方法：我们使用来自中国基层社区卫生服务中心的数据。我们开发了一个针对多种疾病患者量身定制的多方面特征提取框架，该框架由几个组成部分组成：可行性测试、预处理、特征提取的确定、基于llm的快速建模、后处理和中期评估。在该框架中，提取了7种类型的特征信息作为简单特征，识别了3种类型的特征作为复杂特征。在这些简单特征的基础上，我们计算了患者的年龄、BMI和12种疾病危险因素。对简单特征进行100次严格的人工验证实验，对复杂特征进行200次严格的人工验证实验，并对实验结果进行全面的定量和定性评估。结果：该框架对7个直观特征提取的总体f1得分为99.6%，其中基础信息提取的最高f1得分为100%。此外，该框架在3个复杂特征提取方面的总体f1得分为94.4%。我们对结果的分析表明，准确的信息内容提取是该框架的一个本质优势，而确保提取信息格式的一致性仍然是其挑战之一。结论：该框架纳入了来自1225名多病患者的电子健康记录信息，涵盖了41种慢性病的不同范围，并且可以无缝地容纳其他疾病的纳入。这强调了它作为一种提取患者特异性特征的方法的可扩展性和适应性，有效地解决了多疾病研究背景下与信息提取相关的挑战。研究和医疗政策人员可以根据研究目标设置相应的目标，直接使用LLM进行零样本目标特征提取，提取特征信息。这种方法大大提高了研究效率，减少了劳动力需求；此外，由于该框架具有较高的准确性，可以提高研究的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extracting Multifaceted Characteristics of Patients With Chronic Disease Comorbidity: Framework Development Using Large Language Models.

Background: Research on chronic multimorbidity has increasingly become a focal point with the aging of the population. Many studies in this area require detailed patient characteristic information. However, the current methods for extracting such information are complex, time-consuming, and prone to errors. The challenge of quickly and accurately extracting patient characteristics has become a common issue in the study of chronic disease comorbidities.

Objective: Our objective was to establish a comprehensive framework for extracting demographic and disease characteristics of patients with multimorbidity. This framework leverages large language models (LLMs) to extract feature information from unstructured and semistructured electronic health records pertaining to these patients. We investigated the model's proficiency in extracting feature information across 7 dimensions: basic information, disease details, lifestyle habits, family medical history, symptom history, medication recommendations, and dietary advice. In addition, we demonstrated the strengths and limitations of this framework.

Methods: We used data sourced from a grassroots community health service center in China. We developed a multifaceted feature extraction framework tailored for patients with multimorbidity, which consists of several integral components: feasibility testing, preprocessing, the determination of feature extraction, prompt modeling based on LLMs, postprocessing, and midterm evaluation. Within this framework, 7 types of feature information were extracted as straightforward features, and three types of features were identified as intricate features. On the basis of the straightforward features, we calculated patients' age, BMI, and 12 disease risk factors. Rigorous manual verification experiments were conducted 100 times for straightforward features and 200 times for intricate features, followed by comprehensive quantitative and qualitative assessments of the experimental outcomes.

Results: The framework achieved an overall F₁-score of 99.6% for the 7 straightforward feature extractions, with the highest F₁-score of 100% for basic information. In addition, the framework demonstrated an overall F₁-score of 94.4% for the 3 intricate feature extractions. Our analysis of the results revealed that accurate information content extraction is a substantially advantage of this framework, whereas ensuring consistency in the format of extracted information remains one of its challenges.

Conclusions: The framework incorporates electronic health record information from 1225 patients with multimorbidity, covering a diverse range of 41 chronic diseases, and can seamlessly accommodate the inclusion of additional diseases. This underscores its scalability and adaptability as a method for extracting patient-specific characteristics, effectively addressing the challenges associated with information extraction in the context of multidisease research. Research and medical policy personnel can extract feature information by setting corresponding goals based on the research objectives and directly using the LLM for zero-sample target feature extraction. This approach greatly improves research efficiency and reduces labor requirements; moreover, due to the framework's high accuracy, it can increase study reliability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.