Development and validation of the provider documentation summarization quality instrument for large language models.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris Ebby, Jillian Gorski, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar
{"title":"Development and validation of the provider documentation summarization quality instrument for large language models.","authors":"Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris Ebby, Jillian Gorski, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar","doi":"10.1093/jamia/ocaf068","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.</p><p><strong>Materials and methods: </strong>Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.</p><p><strong>Results: </strong>Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).</p><p><strong>Discussion: </strong>The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.</p><p><strong>Conclusions: </strong>The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1050-1060"},"PeriodicalIF":4.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12089781/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf068","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: As large language models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation and as models and documentation practices evolve. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. This study aimed to validate the PDSQI-9 across key aspects of construct validity.

Materials and methods: Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation analyses for substantive validity, factor analysis and Cronbach's α for structural validity, inter-rater reliability (ICC and Krippendorff's α) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Raters underwent standardized training to ensure consistent application of the instrument.

Results: Seven physician raters evaluated 779 summaries and answered 8329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach's α = 0.879; 95% CI, 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI, 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (ρ = -0.200, P = .029) and Organized (ρ = -0.190, P = .037). The semi-Delphi process ensured clinically relevant attributes, and discriminant validity distinguished high- from low-quality summaries (P<.001).

Discussion: The PDSQI-9 showed high inter-rater reliability, internal consistency, and a meaningful factor structure that reliably captured key dimensions of documentation quality. It distinguished between high- and low-quality summaries, supporting its practical utility for health systems needing an evaluation instrument for LLMs.

Conclusions: The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer, more effective integration of LLMs into healthcare workflows.

开发和验证大型语言模型的提供者文档摘要质量工具。
目标:随着大型语言模型(llm)被集成到电子健康记录(EHR)工作流程中,经过验证的工具对于在实施之前评估其性能以及模型和文档实践的发展至关重要。现有的提供商文档质量工具通常不适合llm生成文本的复杂性,并且缺乏对真实数据的验证。开发了提供者文档摘要质量仪器(PDSQI-9)来评估法学硕士生成的临床摘要。本研究旨在从构念效度的关键方面对PDSQI-9进行验证。材料和方法:使用几种LLMs (gpt - 40、Mixtral 8x7b和Llama 3-8b)从多个专业的真实电子病历数据生成多文档摘要。验证包括实质性效度的Pearson相关分析,结构效度的因子分析和Cronbach's α,概括性的评级间信度(ICC和Krippendorff's α),内容效度的半德尔菲过程,以及判别效度的高质量和低质量摘要的比较。评分员接受了标准化培训,以确保该工具的一致性应用。结果:7位医师评分员评估了779个摘要,回答了8329个问题,评分者间信度达到80%以上。PDSQI-9具有较强的内部一致性(Cronbach’s α = 0.879;95% CI, 0.867-0.891)和高评级间信度(ICC = 0.867;95% CI, 0.867-0.868),支持结构效度和概括性。因子分析确定了一个4因素模型,解释了58%的方差,代表了组织性、清晰度、准确性和实用性。短句长度与简洁(ρ = -0.200, P = 0.029)和组织(ρ = -0.190, P = 0.037)得分之间的相关性支持了实质性效度。半德尔菲过程确保了临床相关属性,判别效度区分了高质量和低质量的摘要(讨论:PDSQI-9具有高的评分者间信度、内部一致性和有意义的因子结构,可靠地捕获了文献质量的关键维度。它区分了高质量和低质量的摘要,支持其在需要llm评估工具的卫生系统中的实际效用。结论:PDSQI-9具有强大的结构效度,支持其在临床实践中用于评估法学硕士生成的摘要,并促进法学硕士更安全、更有效地整合到医疗保健工作流程中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信