Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network

EGEMS (Washington, DC) Pub Date : 2019-08-01 DOI:10.5334/EGEMS.294

Ritu Khare, Levon H. Utidjian, H. Razzaghi, Victoria Soucek, Evanette K. Burrows, D. Eckrich, Richard Hoyt, Harris Weinstein, Matthew Miller, David Soler, Joshua Tucker, L. C. Bailey

{"title":"Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network","authors":"Ritu Khare, Levon H. Utidjian, H. Razzaghi, Victoria Soucek, Evanette K. Burrows, D. Eckrich, Richard Hoyt, Harris Weinstein, Matthew Miller, David Soler, Joshua Tucker, L. C. Bailey","doi":"10.5334/EGEMS.294","DOIUrl":null,"url":null,"abstract":"Background: Clinical data research networks (CDRNs) aggregate electronic health record data from multiple hospitals to enable large-scale research. A critical operation toward building a CDRN is conducting continual evaluations to optimize data quality. The key challenges include determining the assessment coverage on big datasets, handling data variability over time, and facilitating communication with data teams. This study presents the evolution of a systematic workflow for data quality assessment in CDRNs. Implementation: Using a specific CDRN as use case, the workflow was iteratively developed and packaged into a toolkit. The resultant toolkit comprises 685 data quality checks to identify any data quality issues, procedures to reconciliate with a history of known issues, and a contemporary GitHub-based reporting mechanism for organized tracking. Results: During the first two years of network development, the toolkit assisted in discovering over 800 data characteristics and resolving over 1400 programming errors. Longitudinal analysis indicated that the variability in time to resolution (15day mean, 24day IQR) is due to the underlying cause of the issue, perceived importance of the domain, and the complexity of assessment. Conclusions: In the absence of a formalized data quality framework, CDRNs continue to face challenges in data management and query fulfillment. The proposed data quality toolkit was empirically validated on a particular network, and is publicly available for other networks. While the toolkit is user-friendly and effective, the usage statistics indicated that the data quality process is very time-intensive and sufficient resources should be dedicated for investigating problems and optimizing data for research.","PeriodicalId":72880,"journal":{"name":"EGEMS (Washington, DC)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EGEMS (Washington, DC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5334/EGEMS.294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Background: Clinical data research networks (CDRNs) aggregate electronic health record data from multiple hospitals to enable large-scale research. A critical operation toward building a CDRN is conducting continual evaluations to optimize data quality. The key challenges include determining the assessment coverage on big datasets, handling data variability over time, and facilitating communication with data teams. This study presents the evolution of a systematic workflow for data quality assessment in CDRNs. Implementation: Using a specific CDRN as use case, the workflow was iteratively developed and packaged into a toolkit. The resultant toolkit comprises 685 data quality checks to identify any data quality issues, procedures to reconciliate with a history of known issues, and a contemporary GitHub-based reporting mechanism for organized tracking. Results: During the first two years of network development, the toolkit assisted in discovering over 800 data characteristics and resolving over 1400 programming errors. Longitudinal analysis indicated that the variability in time to resolution (15day mean, 24day IQR) is due to the underlying cause of the issue, perceived importance of the domain, and the complexity of assessment. Conclusions: In the absence of a formalized data quality framework, CDRNs continue to face challenges in data management and query fulfillment. The proposed data quality toolkit was empirically validated on a particular network, and is publicly available for other networks. While the toolkit is user-friendly and effective, the usage statistics indicated that the data quality process is very time-intensive and sufficient resources should be dedicated for investigating problems and optimizing data for research.

查看原文本刊更多论文

大型儿科研究网络数据质量评估工作流程的设计与优化

背景:临床数据研究网络(cdrn)汇集了来自多家医院的电子健康记录数据，以实现大规模研究。构建CDRN的一个关键操作是进行持续评估以优化数据质量。关键的挑战包括确定大数据集的评估覆盖范围，处理随时间变化的数据，以及促进与数据团队的沟通。本研究提出了cdrn数据质量评估系统工作流程的演变。实现:使用特定的CDRN作为用例，迭代地开发工作流并将其打包到工具包中。由此产生的工具包包括685个数据质量检查，用于识别任何数据质量问题，用于与已知问题历史进行协调的程序，以及用于有组织跟踪的基于github的现代报告机制。结果:在网络开发的前两年，该工具包帮助发现了800多个数据特征并解决了1400多个编程错误。纵向分析表明，解决时间的可变性(平均15天，24天IQR)是由于问题的潜在原因、领域的感知重要性和评估的复杂性。结论:在缺乏形式化数据质量框架的情况下，cdrn在数据管理和查询实现方面继续面临挑战。提出的数据质量工具包在一个特定的网络上进行了经验验证，并且对其他网络公开可用。虽然该工具包是用户友好且有效的，但使用统计数据表明，数据质量过程非常耗时，应该专门用于调查问题和优化数据以供研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EGEMS (Washington, DC)

自引率

0.00%

发文量