优化多中心临床数据集数据质量保证的效率和有效性。

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-05-01 DOI:10.1093/jamia/ocaf042

Anne Fu, Trong Shen, Surain B Roberts, Weihan Liu, Shruthi Vaidyanathan, Kayley-Jasmin Marchena-Romero, Yuen Yu Phyllis Lam, Kieran Shah, Denise Y F Mak, Fahad Razak, Amol A Verma

{"title":"优化多中心临床数据集数据质量保证的效率和有效性。","authors":"Anne Fu, Trong Shen, Surain B Roberts, Weihan Liu, Shruthi Vaidyanathan, Kayley-Jasmin Marchena-Romero, Yuen Yu Phyllis Lam, Kieran Shah, Denise Y F Mak, Fahad Razak, Amol A Verma","doi":"10.1093/jamia/ocaf042","DOIUrl":null,"url":null,"abstract":"Objectives: Electronic health records (EHRs) data are increasingly used for research and analysis, but there is little empirical evidence to inform how automated and manual assessments can be combined to efficiently assess data quality in large EHR repositories.Materials and methods: The GEMINI database collected data from 462 226 patient admissions across 32 hospitals from 2021 to 2023. We report data quality issues identified through semi-automated and manual data quality assessments completed during the data collection phase. We conducted a simulation experiment to evaluate the relationship between the number of records reviewed manually, the detection of true data errors (true positives) and the number of manual chart abstraction errors (false positives) that required unnecessary investigation.Results: The semi-automated data quality assessments identified 79 data quality issues requiring correction, of which 14 had a large impact, affecting at least 50% of records in the data. After resolving issues identified through semi-automated assessments, manual validation of 2676 patient encounters at 19 hospitals identified 4 new meaningful data errors (3 in transfusion data and 1 in physician identifiers), distributed across 4 hospitals. There were 365 manual chart abstraction errors, which required investigation by data analysts to identify as \"false positives.\" These errors increased linearly with the number of charts reviewed manually. Simulation results demonstrate that all 3 transfusion data errors were identified with 95% sensitivity after manual review of 5 records, whereas 18 records were needed for the physician's table.Discussion and conclusion: The GEMINI approach represents a scalable framework for data quality assessment and improvement in multisite EHR research databases. Manual data review is important but can be minimized to optimize the trade-off between true and false identification of data quality errors.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"835-844"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012372/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimizing the efficiency and effectiveness of data quality assurance in a multicenter clinical dataset.\",\"authors\":\"Anne Fu, Trong Shen, Surain B Roberts, Weihan Liu, Shruthi Vaidyanathan, Kayley-Jasmin Marchena-Romero, Yuen Yu Phyllis Lam, Kieran Shah, Denise Y F Mak, Fahad Razak, Amol A Verma\",\"doi\":\"10.1093/jamia/ocaf042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objectives: Electronic health records (EHRs) data are increasingly used for research and analysis, but there is little empirical evidence to inform how automated and manual assessments can be combined to efficiently assess data quality in large EHR repositories.Materials and methods: The GEMINI database collected data from 462 226 patient admissions across 32 hospitals from 2021 to 2023. We report data quality issues identified through semi-automated and manual data quality assessments completed during the data collection phase. We conducted a simulation experiment to evaluate the relationship between the number of records reviewed manually, the detection of true data errors (true positives) and the number of manual chart abstraction errors (false positives) that required unnecessary investigation.Results: The semi-automated data quality assessments identified 79 data quality issues requiring correction, of which 14 had a large impact, affecting at least 50% of records in the data. After resolving issues identified through semi-automated assessments, manual validation of 2676 patient encounters at 19 hospitals identified 4 new meaningful data errors (3 in transfusion data and 1 in physician identifiers), distributed across 4 hospitals. There were 365 manual chart abstraction errors, which required investigation by data analysts to identify as \\\"false positives.\\\" These errors increased linearly with the number of charts reviewed manually. Simulation results demonstrate that all 3 transfusion data errors were identified with 95% sensitivity after manual review of 5 records, whereas 18 records were needed for the physician's table.Discussion and conclusion: The GEMINI approach represents a scalable framework for data quality assessment and improvement in multisite EHR research databases. Manual data review is important but can be minimized to optimize the trade-off between true and false identification of data quality errors.\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"835-844\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012372/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocaf042\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf042","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目标：电子健康记录（EHR）数据越来越多地用于研究和分析，但很少有经验证据表明如何将自动化和手动评估结合起来，以有效地评估大型EHR存储库中的数据质量。材料和方法：GEMINI数据库收集了2021年至2023年32家医院462 226例入院患者的数据。我们报告通过在数据收集阶段完成的半自动和手动数据质量评估确定的数据质量问题。我们进行了一个模拟实验，以评估人工审查的记录数量、真实数据错误（真阳性）的检测以及需要进行不必要调查的手动图表抽象错误（假阳性）的数量之间的关系。结果：半自动化数据质量评估确定了79个需要纠正的数据质量问题，其中14个影响较大，影响数据中至少50%的记录。在解决了通过半自动评估发现的问题后，对19家医院的2676名患者进行了人工验证，发现了4个新的有意义的数据错误（3个在输血数据中，1个在医生标识符中），分布在4家医院。有365个手工图表抽象错误，需要数据分析师进行调查，以确定为“误报”。这些错误随着手动检查图表的数量呈线性增加。模拟结果表明，人工审查5份记录后，所有3个输血数据错误的识别灵敏度为95%，而医生的表格需要18份记录。讨论和结论：GEMINI方法代表了一个可扩展的框架，用于多站点EHR研究数据库的数据质量评估和改进。手动数据审查很重要，但可以最小化以优化数据质量错误的真假识别之间的权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing the efficiency and effectiveness of data quality assurance in a multicenter clinical dataset.

Objectives: Electronic health records (EHRs) data are increasingly used for research and analysis, but there is little empirical evidence to inform how automated and manual assessments can be combined to efficiently assess data quality in large EHR repositories.

Materials and methods: The GEMINI database collected data from 462 226 patient admissions across 32 hospitals from 2021 to 2023. We report data quality issues identified through semi-automated and manual data quality assessments completed during the data collection phase. We conducted a simulation experiment to evaluate the relationship between the number of records reviewed manually, the detection of true data errors (true positives) and the number of manual chart abstraction errors (false positives) that required unnecessary investigation.

Results: The semi-automated data quality assessments identified 79 data quality issues requiring correction, of which 14 had a large impact, affecting at least 50% of records in the data. After resolving issues identified through semi-automated assessments, manual validation of 2676 patient encounters at 19 hospitals identified 4 new meaningful data errors (3 in transfusion data and 1 in physician identifiers), distributed across 4 hospitals. There were 365 manual chart abstraction errors, which required investigation by data analysts to identify as "false positives." These errors increased linearly with the number of charts reviewed manually. Simulation results demonstrate that all 3 transfusion data errors were identified with 95% sensitivity after manual review of 5 records, whereas 18 records were needed for the physician's table.

Discussion and conclusion: The GEMINI approach represents a scalable framework for data quality assessment and improvement in multisite EHR research databases. Manual data review is important but can be minimized to optimize the trade-off between true and false identification of data quality errors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.