TransformEHRs: a flexible methodology for building transparent ETL processes for EHR reuse.

IF 1.8 4区医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Methods of Information in Medicine Pub Date : 2022-12-01 DOI:10.1055/s-0042-1757763

Miguel Pedrera-Jiménez, Noelia García-Barrio, Paula Rubio-Mayo, Alberto Tato-Gómez, Juan Luis Cruz-Bermúdez, José Luis Bernal-Sobrino, Adolfo Muñoz-Carrero, Pablo Serrano-Balazote

{"title":"TransformEHRs: a flexible methodology for building transparent ETL processes for EHR reuse.","authors":"Miguel Pedrera-Jiménez, Noelia García-Barrio, Paula Rubio-Mayo, Alberto Tato-Gómez, Juan Luis Cruz-Bermúdez, José Luis Bernal-Sobrino, Adolfo Muñoz-Carrero, Pablo Serrano-Balazote","doi":"10.1055/s-0042-1757763","DOIUrl":null,"url":null,"abstract":"Background: During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable.Objectives: This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization.Methods: The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML.Results: First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined.Conclusions: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"61 S 02","pages":"e89-e102"},"PeriodicalIF":1.8000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/54/b2/10-1055-s-0042-1757763.PMC9788916.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0042-1757763","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

Background: During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable.

Objectives: This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization.

Methods: The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML.

Results: First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined.

Conclusions: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.

Abstract Image

查看原文本刊更多论文

transformhhr:一种灵活的方法，用于为EHR重用构建透明的ETL过程。

背景:在2019冠状病毒病大流行期间，设计了几种方法来获取用于研究的电子健康记录(EHR)衍生数据集。这些过程通常基于黑箱，临床研究人员不知道数据是如何记录、提取和转换的。为了解决这个问题，提取、转换和加载(ETL)过程必须基于透明、同质和形式化的方法，使它们易于理解、可重复和可审计。目的:本研究旨在设计和实施一种方法，根据公平原则，以透明和灵活的方式构建电子病历重用的ETL流程(重点是数据提取、选择和转换)，适用于任何临床条件和卫生保健组织。方法:提出的方法包括四个阶段:(1)基于国际通用的临床知识库、病例报告表格和汇总数据集，分析二次使用模型和识别数据操作;(2)通过《详细临床模型》范式对数据操作进行建模和形式化;(3)数据操作的不可知论开发，选择SQL和R作为编程语言;(4)自动化ETL实例化，用XML构建正式的配置文件。结果:首先，对4个国际项目进行了分析，确定了17项操作，需要根据这些项目的规范从EHR中获取数据集。这样，使用ISO 13606参考模型对每个数据操作进行了形式化，指定了有效的数据类型作为参数、输入和输出，以及它们的基数。然后，通过先前选择的面向数据的编程语言开发了一个不可知的数据目录。最后，从正式定义的ETL配置文件构建了一个自动化的ETL实例化过程。结论:本研究提供了一个透明和灵活的解决方案，使获取ehr衍生数据用于二次使用的过程易于理解，可审计和可重复。此外，本研究中进行的抽象意味着任何以前的EHR重用方法都可以将这些结果纳入其中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Methods of Information in Medicine 医学-计算机：信息系统

CiteScore

3.70

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.