Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2024-10-21 DOI:10.2196/60164

Miguel Nunes, Joao Bone, Joao C Ferreira, Luis B Elvas

{"title":"Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review.","authors":"Miguel Nunes, Joao Bone, Joao C Ferreira, Luis B Elvas","doi":"10.2196/60164","DOIUrl":null,"url":null,"abstract":"Background: In response to the intricate language, specialized terminology outside everyday life, and the frequent presence of abbreviations and acronyms inherent in health care text data, domain adaptation techniques have emerged as crucial to transformer-based models. This refinement in the knowledge of the language models (LMs) allows for a better understanding of the medical textual data, which results in an improvement in medical downstream tasks, such as information extraction (IE). We have identified a gap in the literature regarding health care LMs. Therefore, this study presents a scoping literature review investigating domain adaptation methods for transformers in health care, differentiating between English and non-English languages, focusing on Portuguese. Most specifically, we investigated the development of health care LMs, with the aim of comparing Portuguese with other more developed languages to guide the path of a non-English-language with fewer resources.Objective: This study aimed to research health care IE models, regardless of language, to understand the efficacy of transformers and what are the medical entities most commonly extracted.Methods: This scoping review was conducted using the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) methodology on Scopus and Web of Science Core Collection databases. Only studies that mentioned the creation of health care LMs or health care IE models were included, while large language models (LLMs) were excluded. The latest were not included since we wanted to research LMs and not LLMs, which are architecturally different and have distinct purposes.Results: Our search query retrieved 137 studies, 60 of which met the inclusion criteria, and none of them were systematic literature reviews. English and Chinese are the languages with the most health care LMs developed. These languages already have disease-specific LMs, while others only have general-health care LMs. European Portuguese does not have any public health care LM and should take examples from other languages to develop, first, general-health care LMs and then, in an advanced phase, disease-specific LMs. Regarding IE models, transformers were the most commonly used method, and named entity recognition was the most popular topic, with only a few studies mentioning Assertion Status or addressing medical lexical problems. The most extracted entities were diagnosis, posology, and symptoms.Conclusions: The findings indicate that domain adaptation is beneficial, achieving better results in downstream tasks. Our analysis allowed us to understand that the use of transformers is more developed for the English and Chinese languages. European Portuguese lacks relevant studies and should draw examples from other non-English languages to develop these models and drive progress in AI. Health care professionals could benefit from highlighting medically relevant information and optimizing the reading of the textual data, or this information could be used to create patient medical timelines, allowing for profiling.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e60164"},"PeriodicalIF":3.1000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11535799/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/60164","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In response to the intricate language, specialized terminology outside everyday life, and the frequent presence of abbreviations and acronyms inherent in health care text data, domain adaptation techniques have emerged as crucial to transformer-based models. This refinement in the knowledge of the language models (LMs) allows for a better understanding of the medical textual data, which results in an improvement in medical downstream tasks, such as information extraction (IE). We have identified a gap in the literature regarding health care LMs. Therefore, this study presents a scoping literature review investigating domain adaptation methods for transformers in health care, differentiating between English and non-English languages, focusing on Portuguese. Most specifically, we investigated the development of health care LMs, with the aim of comparing Portuguese with other more developed languages to guide the path of a non-English-language with fewer resources.

Objective: This study aimed to research health care IE models, regardless of language, to understand the efficacy of transformers and what are the medical entities most commonly extracted.

Methods: This scoping review was conducted using the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) methodology on Scopus and Web of Science Core Collection databases. Only studies that mentioned the creation of health care LMs or health care IE models were included, while large language models (LLMs) were excluded. The latest were not included since we wanted to research LMs and not LLMs, which are architecturally different and have distinct purposes.

Results: Our search query retrieved 137 studies, 60 of which met the inclusion criteria, and none of them were systematic literature reviews. English and Chinese are the languages with the most health care LMs developed. These languages already have disease-specific LMs, while others only have general-health care LMs. European Portuguese does not have any public health care LM and should take examples from other languages to develop, first, general-health care LMs and then, in an advanced phase, disease-specific LMs. Regarding IE models, transformers were the most commonly used method, and named entity recognition was the most popular topic, with only a few studies mentioning Assertion Status or addressing medical lexical problems. The most extracted entities were diagnosis, posology, and symptoms.

Conclusions: The findings indicate that domain adaptation is beneficial, achieving better results in downstream tasks. Our analysis allowed us to understand that the use of transformers is more developed for the English and Chinese languages. European Portuguese lacks relevant studies and should draw examples from other non-English languages to develop these models and drive progress in AI. Health care professionals could benefit from highlighting medically relevant information and optimizing the reading of the textual data, or this information could be used to create patient medical timelines, allowing for profiling.

查看原文本刊更多论文

医疗保健语言模型及其用于信息提取的微调：范围审查。

背景：针对医疗文本数据中固有的错综复杂的语言、日常生活之外的专业术语以及频繁出现的缩略语和首字母缩写词，领域适应技术已成为基于转换器的模型的关键。语言模型（LMs）知识的这种完善可以更好地理解医疗文本数据，从而改进医疗下游任务，如信息提取（IE）。我们发现有关医疗保健语言模型的文献存在空白。因此，本研究对医疗保健领域转换器的领域适应方法进行了文献综述，区分了英语和非英语语言，重点关注葡萄牙语。最具体地说，我们调查了医疗保健 LM 的发展情况，目的是将葡萄牙语与其他更发达的语言进行比较，以指导资源较少的非英语语言的发展道路：本研究旨在研究医疗保健 IE 模型（无论使用哪种语言），以了解转换器的功效以及最常提取的医疗实体：本范围综述采用 PRISMA-ScR（Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews）方法在 Scopus 和 Web of Science Core Collection 数据库中进行。只有提及创建医疗保健 LM 或医疗保健 IE 模型的研究才被纳入，而大型语言模型 (LLM) 则被排除在外。最新的研究未被包括在内，因为我们想研究的是 LM，而不是 LLM，它们在结构上不同，有不同的目的：我们的搜索查询检索到 137 项研究，其中 60 项符合纳入标准，但没有一项是系统性文献综述。英语和汉语是开发了最多医疗保健 LM 的语言。这些语言已经有了针对特定疾病的 LM，而其他语言只有普通保健 LM。欧洲葡萄牙语没有任何公共保健 LM，应借鉴其他语言的例子，首先开发普通保健 LM，然后在高级阶段开发特定疾病 LM。关于 IE 模型，转换器是最常用的方法，命名实体识别是最受欢迎的主题，只有少数研究提到了断言状态或解决医学词汇问题。提取最多的实体是诊断、姿势和症状：研究结果表明，领域适应是有益的，可以在下游任务中取得更好的结果。我们通过分析了解到，转换器的使用在英语和汉语中更为成熟。欧洲葡萄牙语缺乏相关研究，应借鉴其他非英语语言的例子来开发这些模型，推动人工智能的进步。突出显示医疗相关信息并优化文本数据的阅读，或将这些信息用于创建患者医疗时间表，从而进行特征分析，都能让医疗保健专业人员从中受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.