CaseReportBench:用于临床病例报告密集信息提取的LLM基准数据集。

Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu
{"title":"CaseReportBench:用于临床病例报告密集信息提取的LLM基准数据集。","authors":"Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce <b>CaseReportBench</b>, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of <b>category-specific prompting</b> and <b>subheading-filtered data integration</b>. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. <b>Category-specific prompting</b> improves alignment to benchmark. Open-source <b>Qwen2.5:7B</b> outperforms <b>GPT-4o</b> for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"287 ","pages":"527-542"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477612/pdf/","citationCount":"0","resultStr":"{\"title\":\"CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports.\",\"authors\":\"Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce <b>CaseReportBench</b>, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of <b>category-specific prompting</b> and <b>subheading-filtered data integration</b>. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. <b>Category-specific prompting</b> improves alignment to benchmark. Open-source <b>Qwen2.5:7B</b> outperforms <b>GPT-4o</b> for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.</p>\",\"PeriodicalId\":74504,\"journal\":{\"name\":\"Proceedings of machine learning research\",\"volume\":\"287 \",\"pages\":\"527-542\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477612/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of machine learning research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

罕见疾病,包括先天性代谢错误(IEM),构成了重大的诊断挑战。病例报告是提供诊断信息的关键资源,但在计算上未得到充分利用。临床密集信息抽取是指将医疗信息组织成结构化的预定义类别。大型语言模型(llm)可以从案例报告中提取可扩展的信息,但很少为此任务进行评估。我们介绍了CaseReportBench,这是一个专家注释的数据集,用于案例报告的密集信息提取(专注于IEMs)。使用该数据集,我们评估了各种模型和提示,引入了特定类别提示和副标题过滤数据集成的新策略。零射击的思维链比零射击的提示几乎没有优势。特定于类别的提示改进了与基准的一致性。开源Qwen2.5:7B在此任务上优于gpt - 40。我们的临床医生评估表明,llm可以从病例报告中提取临床相关细节,支持罕见病的诊断和管理。我们还强调了需要改进的领域,例如llm在鉴别诊断中识别阴性结果方面的局限性。这项工作推进了法学硕士驱动的临床NLP,为可扩展的医疗人工智能应用铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports.

Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信