铝标准:使用生成式人工智能工具来合成和注释非结构化的患者数据。

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Juan G Diaz Ochoa, Faizan E Mustafa, Felix Weil, Yi Wang, Kudret Kama, Markus Knott
{"title":"铝标准:使用生成式人工智能工具来合成和注释非结构化的患者数据。","authors":"Juan G Diaz Ochoa, Faizan E Mustafa, Felix Weil, Yi Wang, Kudret Kama, Markus Knott","doi":"10.1186/s12911-024-02825-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Medical narratives are fundamental to the correct identification of a patient's health condition. This is not only because it describes the patient's situation. It also contains relevant information about the patient's context and health state evolution. Narratives are usually vague and cannot be categorized easily. On the other hand, once the patient's situation is correctly identified based on a narrative, it is then possible to map the patient's situation into precise classification schemas and ontologies that are machine-readable. To this end, language models can be trained to read and extract elements from these narratives. However, the main problem is the lack of data for model identification and model training in languages other than English. First, gold standard annotations are usually not available due to the high level of data protection for patient data. Second, gold standard annotations (if available) are difficult to access. Alternative available data, like MIMIC (Sci Data 3:1, 2016) is written in English and for specific patient conditions like intensive care. Thus, when model training is required for other types of patients, like oncology (and not intensive care), this could lead to bias. To facilitate clinical narrative model training, a method for creating high-quality synthetic narratives is needed.</p><p><strong>Method: </strong>We devised workflows based on generative AI methods to synthesize narratives in the German language to avoid the disclosure of patient's health data. Since we required highly realistic narratives, we generated prompts, written with high-quality medical terminology, asking for clinical narratives containing both a main and co-disease. The frequency of distribution of both the main and co-disease was extracted from the hospital's structured data, such that the synthetic narratives reflect the disease distribution among the patient's cohort. In order to validate the quality of the synthetic narratives, we annotated them to train a Named Entity Recognition (NER) algorithm. According to our assumptions, the validation of this system implies that the synthesized data used for its training are of acceptable quality.</p><p><strong>Result: </strong>We report precision, recall and F1 score for the NER model while also considering metrics that take into account both exact and partial entity matches. Trained models are cautious, with a precision up to 0.8 for Entity Type match metric and a F1 score of 0.3.</p><p><strong>Conclusion: </strong>Despite its inherent limitations, this technology has the potential to allow data interoperability by using encoded diseases across languages and regions without compromising data safety. Additionally, it facilitates the synthesis of unstructured patient data. In this way, the identification and training of models can be accelerated. We believe that this method may be able to generate discharge letters for any combination of main and co-diseases, which will significantly reduce the amount of time spent writing these letters by healthcare professionals.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"409"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681671/pdf/","citationCount":"0","resultStr":"{\"title\":\"The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data.\",\"authors\":\"Juan G Diaz Ochoa, Faizan E Mustafa, Felix Weil, Yi Wang, Kudret Kama, Markus Knott\",\"doi\":\"10.1186/s12911-024-02825-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Medical narratives are fundamental to the correct identification of a patient's health condition. This is not only because it describes the patient's situation. It also contains relevant information about the patient's context and health state evolution. Narratives are usually vague and cannot be categorized easily. On the other hand, once the patient's situation is correctly identified based on a narrative, it is then possible to map the patient's situation into precise classification schemas and ontologies that are machine-readable. To this end, language models can be trained to read and extract elements from these narratives. However, the main problem is the lack of data for model identification and model training in languages other than English. First, gold standard annotations are usually not available due to the high level of data protection for patient data. Second, gold standard annotations (if available) are difficult to access. Alternative available data, like MIMIC (Sci Data 3:1, 2016) is written in English and for specific patient conditions like intensive care. Thus, when model training is required for other types of patients, like oncology (and not intensive care), this could lead to bias. To facilitate clinical narrative model training, a method for creating high-quality synthetic narratives is needed.</p><p><strong>Method: </strong>We devised workflows based on generative AI methods to synthesize narratives in the German language to avoid the disclosure of patient's health data. Since we required highly realistic narratives, we generated prompts, written with high-quality medical terminology, asking for clinical narratives containing both a main and co-disease. The frequency of distribution of both the main and co-disease was extracted from the hospital's structured data, such that the synthetic narratives reflect the disease distribution among the patient's cohort. In order to validate the quality of the synthetic narratives, we annotated them to train a Named Entity Recognition (NER) algorithm. According to our assumptions, the validation of this system implies that the synthesized data used for its training are of acceptable quality.</p><p><strong>Result: </strong>We report precision, recall and F1 score for the NER model while also considering metrics that take into account both exact and partial entity matches. Trained models are cautious, with a precision up to 0.8 for Entity Type match metric and a F1 score of 0.3.</p><p><strong>Conclusion: </strong>Despite its inherent limitations, this technology has the potential to allow data interoperability by using encoded diseases across languages and regions without compromising data safety. Additionally, it facilitates the synthesis of unstructured patient data. In this way, the identification and training of models can be accelerated. We believe that this method may be able to generate discharge letters for any combination of main and co-diseases, which will significantly reduce the amount of time spent writing these letters by healthcare professionals.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"24 1\",\"pages\":\"409\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681671/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02825-4\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02825-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:医学叙述是正确识别病人健康状况的基础。这不仅是因为它描述了病人的情况。它还包含有关患者背景和健康状况演变的相关信息。叙述通常是模糊的,不容易分类。另一方面,一旦根据叙述正确识别了患者的情况,就可以将患者的情况映射到机器可读的精确分类模式和本体中。为此,可以训练语言模型从这些叙述中阅读和提取元素。然而,主要问题是缺乏非英语语言的模型识别和模型训练数据。首先,由于对患者数据的高度数据保护,金标准注释通常不可用。其次,金标准注释(如果可用)很难访问。其他可用的数据,如MIMIC (Sci data 3:1, 2016)是用英语编写的,用于特定的患者情况,如重症监护。因此,当需要对其他类型的患者进行模型训练时,比如肿瘤学(而不是重症监护),这可能会导致偏差。为了促进临床叙事模型的训练,需要一种创造高质量综合叙事的方法。方法:设计基于生成式人工智能方法的工作流程,以合成德语叙事,避免患者健康数据的泄露。由于我们需要高度写实的叙述,我们生成了提示,用高质量的医学术语书写,要求临床叙述包括主要疾病和共同疾病。从医院的结构化数据中提取主要疾病和共同疾病的分布频率,以便综合叙述反映患者队列中的疾病分布。为了验证合成叙述的质量,我们对它们进行注释以训练命名实体识别(NER)算法。根据我们的假设,该系统的验证意味着用于其训练的综合数据具有可接受的质量。结果:我们报告了NER模型的精度、召回率和F1分数,同时还考虑了考虑精确和部分实体匹配的指标。经过训练的模型是谨慎的,实体类型匹配度量的精度高达0.8,F1得分为0.3。结论:尽管存在固有的局限性,但该技术有可能通过使用跨语言和区域的编码疾病来实现数据互操作性,而不会影响数据安全性。此外,它促进了非结构化患者数据的合成。这样可以加快模型的识别和训练。我们认为,这种方法可以为任何主要和共患疾病的组合生成出院信,这将大大减少卫生保健专业人员编写这些信件的时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data.

Background: Medical narratives are fundamental to the correct identification of a patient's health condition. This is not only because it describes the patient's situation. It also contains relevant information about the patient's context and health state evolution. Narratives are usually vague and cannot be categorized easily. On the other hand, once the patient's situation is correctly identified based on a narrative, it is then possible to map the patient's situation into precise classification schemas and ontologies that are machine-readable. To this end, language models can be trained to read and extract elements from these narratives. However, the main problem is the lack of data for model identification and model training in languages other than English. First, gold standard annotations are usually not available due to the high level of data protection for patient data. Second, gold standard annotations (if available) are difficult to access. Alternative available data, like MIMIC (Sci Data 3:1, 2016) is written in English and for specific patient conditions like intensive care. Thus, when model training is required for other types of patients, like oncology (and not intensive care), this could lead to bias. To facilitate clinical narrative model training, a method for creating high-quality synthetic narratives is needed.

Method: We devised workflows based on generative AI methods to synthesize narratives in the German language to avoid the disclosure of patient's health data. Since we required highly realistic narratives, we generated prompts, written with high-quality medical terminology, asking for clinical narratives containing both a main and co-disease. The frequency of distribution of both the main and co-disease was extracted from the hospital's structured data, such that the synthetic narratives reflect the disease distribution among the patient's cohort. In order to validate the quality of the synthetic narratives, we annotated them to train a Named Entity Recognition (NER) algorithm. According to our assumptions, the validation of this system implies that the synthesized data used for its training are of acceptable quality.

Result: We report precision, recall and F1 score for the NER model while also considering metrics that take into account both exact and partial entity matches. Trained models are cautious, with a precision up to 0.8 for Entity Type match metric and a F1 score of 0.3.

Conclusion: Despite its inherent limitations, this technology has the potential to allow data interoperability by using encoded diseases across languages and regions without compromising data safety. Additionally, it facilitates the synthesis of unstructured patient data. In this way, the identification and training of models can be accelerated. We believe that this method may be able to generate discharge letters for any combination of main and co-diseases, which will significantly reduce the amount of time spent writing these letters by healthcare professionals.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信