Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu
{"title":"CaseReportBench:用于临床病例报告密集信息提取的LLM基准数据集。","authors":"Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce <b>CaseReportBench</b>, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of <b>category-specific prompting</b> and <b>subheading-filtered data integration</b>. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. <b>Category-specific prompting</b> improves alignment to benchmark. Open-source <b>Qwen2.5:7B</b> outperforms <b>GPT-4o</b> for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"287 ","pages":"527-542"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477612/pdf/","citationCount":"0","resultStr":"{\"title\":\"CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports.\",\"authors\":\"Xiao Yu Cindy Zhang, Carlos R Ferreira, Francis Rossignol, Raymond T Ng, Wyeth Wasserman, Jian Zhu\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce <b>CaseReportBench</b>, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of <b>category-specific prompting</b> and <b>subheading-filtered data integration</b>. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. <b>Category-specific prompting</b> improves alignment to benchmark. Open-source <b>Qwen2.5:7B</b> outperforms <b>GPT-4o</b> for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.</p>\",\"PeriodicalId\":74504,\"journal\":{\"name\":\"Proceedings of machine learning research\",\"volume\":\"287 \",\"pages\":\"527-542\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477612/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of machine learning research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports.
Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports (focusing on IEMs). Using this dataset, we assess various models and promptings, introducing novel strategies of category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought offers little advantage over zero-shot prompting. Category-specific prompting improves alignment to benchmark. Open-source Qwen2.5:7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings for differential diagnosis. This work advances LLM-driven clinical NLP, paving the way for scalable medical AI applications.