LCD benchmark: long clinical document benchmark on mortality prediction for language models.

IF 4.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-02-01 DOI:10.1093/jamia/ocae287

WonJin Yoon, Shan Chen, Yanjun Gao, Zhanzhan Zhao, Dmitriy Dligach, Danielle S Bitterman, Majid Afshar, Timothy Miller

{"title":"LCD benchmark: long clinical document benchmark on mortality prediction for language models.","authors":"WonJin Yoon, Shan Chen, Yanjun Gao, Zhanzhan Zhao, Dmitriy Dligach, Danielle S Bitterman, Majid Afshar, Timothy Miller","doi":"10.1093/jamia/ocae287","DOIUrl":null,"url":null,"abstract":"Objectives: The application of natural language processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent.Materials and methods: To address this issue, we propose Long Clinical Document (LCD) benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of Medical Information Mart for Intensive Care IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and convolutional neural network to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations.Results: Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1 metrics. Notes in our dataset have a median word count of 1687.Discussion: Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text.Conclusion: We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"285-295"},"PeriodicalIF":4.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11756648/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae287","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: The application of natural language processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent.

Materials and methods: To address this issue, we propose Long Clinical Document (LCD) benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of Medical Information Mart for Intensive Care IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and convolutional neural network to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations.

Results: Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1 metrics. Notes in our dataset have a median word count of 1687.

Discussion: Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text.

Conclusion: We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text.

查看原文本刊更多论文

LCD 基准：关于语言模型死亡率预测的长篇临床文件基准。

目的：在临床领域应用自然语言处理（NLP）非常重要，因为临床文档中包含丰富的非结构化信息，而结构化数据往往无法获取这些信息。当将 NLP 方法应用于某一领域时，基准数据集的作用至关重要，因为基准数据集不仅能指导选择性能最佳的模型，还能评估生成输出的可靠性。尽管最近出现了能够处理较长上下文的语言模型，但针对长临床文档分类任务的基准数据集却并不存在：为了解决这个问题，我们提出了长临床文档（Long Clinical Document，LCD）基准，这是一个利用重症监护四级医疗信息市场的出院记录和全州死亡数据预测 30 天院外死亡率任务的基准。我们使用从词袋和卷积神经网络到指令调整大语言模型的基线模型对该基准数据集进行了评估。此外，我们还对模型输出进行了全面分析，包括人工审核和模型权重可视化，以深入了解其预测能力和局限性：在 F1 指标中，基准模型显示最佳监督模型为 28.9%，GPT-4 为 32.2%。数据集中的笔记字数中位数为 1687.讨论：对模型输出的分析表明，我们的数据集对于模型和人类专家来说都具有挑战性，但模型可以从文本中找到有意义的信号：我们希望我们的 LCD 基准能成为开发高级监督模型或提示方法的资源，这些模型或提示方法都是为临床文本量身定制的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.