Development and validation of the provider documentation summarization quality instrument for large language models.

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-06-01 DOI:10.1093/jamia/ocaf068

Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris Ebby, Jillian Gorski, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar

{"title":"Development and validation of the provider documentation summarization quality instrument for large language models.","authors":"Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris Ebby, Jillian Gorski, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar","doi":"10.1093/jamia/ocaf068","DOIUrl":null,"url":null,"abstract":"Objectives: As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.Materials and methods: Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.Results: Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).Discussion: The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.Conclusions: The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1050-1060"},"PeriodicalIF":4.6000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12089781/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf068","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.

Materials and methods: Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.

Results: Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).

Discussion: The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.

Conclusions: The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.

查看原文本刊更多论文

开发和验证大型语言模型的提供者文档摘要质量工具。

目标：随着大型语言模型（llm）被集成到电子健康记录（EHR）工作流程中，经过验证的工具对于在实施之前评估其性能以及模型和文档实践的发展至关重要。现有的提供商文档质量工具通常不适合llm生成文本的复杂性，并且缺乏对真实数据的验证。开发了提供者文档摘要质量仪器（PDSQI-9）来评估法学硕士生成的临床摘要。本研究旨在从构念效度的关键方面对PDSQI-9进行验证。材料和方法：使用几种LLMs （gpt - 40、Mixtral 8x7b和Llama 3-8b）从多个专业的真实电子病历数据生成多文档摘要。验证包括实质性效度的Pearson相关分析，结构效度的因子分析和Cronbach's α，概括性的评级间信度（ICC和Krippendorff's α），内容效度的半德尔菲过程，以及判别效度的高质量和低质量摘要的比较。评分员接受了标准化培训，以确保该工具的一致性应用。结果：7位医师评分员评估了779个摘要，回答了8329个问题，评分者间信度达到80%以上。PDSQI-9具有较强的内部一致性(Cronbach’s α = 0.879；95% CI, 0.867-0.891)和高评级间信度(ICC = 0.867；95% CI, 0.867-0.868)，支持结构效度和概括性。因子分析确定了一个4因素模型，解释了58%的方差，代表了组织性、清晰度、准确性和实用性。短句长度与简洁（ρ = -0.200, P = 0.029）和组织（ρ = -0.190, P = 0.037）得分之间的相关性支持了实质性效度。半德尔菲过程确保了临床相关属性，判别效度区分了高质量和低质量的摘要(讨论：PDSQI-9具有高的评分者间信度、内部一致性和有意义的因子结构，可靠地捕获了文献质量的关键维度。它区分了高质量和低质量的摘要，支持其在需要llm评估工具的卫生系统中的实际效用。结论：PDSQI-9具有强大的结构效度，支持其在临床实践中用于评估法学硕士生成的摘要，并促进法学硕士更安全、更有效地整合到医疗保健工作流程中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.