MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2024-06-26 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1371680

Alaa Marshan, Anwar Nais Almutairi, Athina Ioannou, David Bell, Asmat Monaghan, Mahir Arzoky

{"title":"MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain.","authors":"Alaa Marshan, Anwar Nais Almutairi, Athina Ioannou, David Bell, Asmat Monaghan, Mahir Arzoky","doi":"10.3389/fdata.2024.1371680","DOIUrl":null,"url":null,"abstract":"Introduction: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.Methods: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.Results: For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.Discussion: Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1371680"},"PeriodicalIF":2.4000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11233734/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1371680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.

Methods: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.

Results: For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.

Discussion: Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.

查看原文本刊更多论文

MedT5SQL：基于转换器的大型语言模型，用于医疗保健领域文本到 SQL 的转换。

导言：随着存储在数据库中的电子病历（EMR）的日益普及，医护人员由于数据库操作方面的专业技术有限，在检索这些病历时遇到了困难。由于这些记录对提供适当的医疗服务至关重要，因此需要一种便于医护人员访问 EMR 的方法：为解决这一问题，文本到 SQL 的自然语言处理（NLP）已成为一种解决方案，使非技术用户能够使用自然语言文本生成 SQL 查询。本研究评估了现有的文本到 SQL 转换工作，并提出了专为 EMR 检索设计的 MedT5SQL 模型。所提议的模型利用了文本到文本转换器（T5）模型，这是一种常用于各种基于文本的 NLP 任务的大型语言模型（LLM）。该模型在 MIMICSQL 数据集上进行了微调，这是医疗保健领域首个文本到 SQL 数据集。性能评估包括在两个优化器上对 MedT5SQL 模型进行基准测试，使用两个数据集（MIMICSQL 和 WikiSQL）进行不同数量的训练历时：对于MIMICSQL数据集，该模型在生成问题-SQL对方面表现出了相当高的效率，在精确匹配精度矩阵、近似字符串匹配和人工评估方面的准确率分别达到了80.63%、98.937%和90%。在 WikiSQL 数据集上测试该模型的性能时，该模型显示出生成 SQL 查询的效率，在 WikiSQL 数据集上的准确率为 44.2%，近似字符串匹配的准确率为 94.26%：讨论：结果表明，随着训练历时的增加，性能也有所提高。这项工作凸显了微调 T5 模型将医疗保健领域中以自然语言编写的医学相关问题转换为结构化查询语言（SQL）的潜力，为该领域的未来研究奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊