Tracking provenance in clinical data warehouses for quality management

IF 3.7 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Marco Johns, Lena Baum, Fabian Prasser
{"title":"Tracking provenance in clinical data warehouses for quality management","authors":"Marco Johns,&nbsp;Lena Baum,&nbsp;Fabian Prasser","doi":"10.1016/j.ijmedinf.2024.105690","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data provenance, which documents the origin, history, and transformations of data, can enhance the reproducibility of processing workflows and help to address errors and quality issues. In this work, we focus on tracking and utilizing provenance information as part of quality management in Extract-Transform-Load (ETL) processes used to build clinical data warehouses.</div></div><div><h3>Methods</h3><div>We designed and implemented a framework that automatically tracks how data flows through an ETL process and detects errors and quality problems during processing. This information is then reported against an Application Programming Interface (API) that stores the issues along with contextual information on their location within the data being transformed and the overall workflow. We further designed a dashboard that supports health data engineers with inspecting the encountered issues and tracing them back to their root causes.</div></div><div><h3>Results</h3><div>The framework was implemented in Java using the Spring Framework and integrated into ETL processes for Informatics for Integrating Biology and the Bedside (i2b2). The dashboard was realized using Grafana. We evaluated our approach on three different ETL processes for real-world datasets used to integrate them into our i2b2 clinical data warehouse. Using the provenance dashboard, we were able to identify frequent error patterns and link them to specific data points from the sources as well as ETL process steps. Provenance tracking increased the execution times of loading processes with an impact depending on the number of identified issues.</div></div><div><h3>Conclusions</h3><div>Provenance tracking can be a valuable tool for implementing continuous quality management for ETL processes. Relevant information can be collected from existing ETL workloads using dedicated APIs and visualized through dashboards, which support the identification of frequent patterns of problems together with their root causes, providing valuable information for improvements.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"193 ","pages":"Article 105690"},"PeriodicalIF":3.7000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003538","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Data provenance, which documents the origin, history, and transformations of data, can enhance the reproducibility of processing workflows and help to address errors and quality issues. In this work, we focus on tracking and utilizing provenance information as part of quality management in Extract-Transform-Load (ETL) processes used to build clinical data warehouses.

Methods

We designed and implemented a framework that automatically tracks how data flows through an ETL process and detects errors and quality problems during processing. This information is then reported against an Application Programming Interface (API) that stores the issues along with contextual information on their location within the data being transformed and the overall workflow. We further designed a dashboard that supports health data engineers with inspecting the encountered issues and tracing them back to their root causes.

Results

The framework was implemented in Java using the Spring Framework and integrated into ETL processes for Informatics for Integrating Biology and the Bedside (i2b2). The dashboard was realized using Grafana. We evaluated our approach on three different ETL processes for real-world datasets used to integrate them into our i2b2 clinical data warehouse. Using the provenance dashboard, we were able to identify frequent error patterns and link them to specific data points from the sources as well as ETL process steps. Provenance tracking increased the execution times of loading processes with an impact depending on the number of identified issues.

Conclusions

Provenance tracking can be a valuable tool for implementing continuous quality management for ETL processes. Relevant information can be collected from existing ETL workloads using dedicated APIs and visualized through dashboards, which support the identification of frequent patterns of problems together with their root causes, providing valuable information for improvements.
跟踪临床数据仓库中的出处,促进质量管理
导言数据出处记录了数据的来源、历史和转换,可以提高处理工作流程的可重复性,并有助于解决错误和质量问题。我们设计并实施了一个框架,该框架可自动跟踪数据如何在 ETL 流程中流动,并检测处理过程中的错误和质量问题。然后根据应用程序接口(API)报告这些信息,应用程序接口会存储这些问题以及它们在正在转换的数据中的位置和整个工作流程的上下文信息。我们还设计了一个仪表盘,支持健康数据工程师检查遇到的问题并追溯其根源。结果该框架使用 Spring 框架在 Java 中实现,并集成到了生物与床边整合信息学(i2b2)的 ETL 流程中。仪表盘使用 Grafana 实现。我们在三个不同的 ETL 流程中对我们的方法进行了评估,这些流程用于将真实世界的数据集集成到我们的 i2b2 临床数据仓库中。通过使用出处仪表板,我们能够识别出经常出现的错误模式,并将它们与数据源中的特定数据点以及 ETL 流程步骤联系起来。出处跟踪增加了加载流程的执行时间,其影响取决于已识别问题的数量。可以使用专用的 API 从现有的 ETL 工作负载中收集相关信息,并通过仪表盘将其可视化,从而支持识别经常出现的问题模式及其根本原因,为改进工作提供有价值的信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Medical Informatics
International Journal of Medical Informatics 医学-计算机:信息系统
CiteScore
8.90
自引率
4.10%
发文量
217
审稿时长
42 days
期刊介绍: International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信