Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.

IF 1.9 Q3 MEDICINE, RESEARCH & EXPERIMENTAL

Interactive Journal of Medical Research Pub Date : 2023-09-21 DOI:10.2196/44310

Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao

{"title":"Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.","authors":"Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao","doi":"10.2196/44310","DOIUrl":null,"url":null,"abstract":"<p><p>With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a \"data disaster.\" Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting \"dirty data,\" which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.</p>","PeriodicalId":51757,"journal":{"name":"Interactive Journal of Medical Research","volume":"12 ","pages":"e44310"},"PeriodicalIF":1.9000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interactive Journal of Medical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/44310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a "data disaster." Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting "dirty data," which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.

查看原文本刊更多论文

面向真实世界数据的正常工作流程和数据清理的关键策略：观点。

在过去的20年里，随着科学、技术和工程的快速发展，在许多领域产生了大量的数据。在医学研究过程中，数据不断生成，大量真实世界的数据形成了“数据灾难”。有效的数据分析和挖掘基于数据可用性和高数据质量。高数据质量的前提是需要清理数据。数据清理是检测和纠正“脏数据”的过程，是数据分析和管理的基础。此外，数据清理是提高数据质量的常用技术。然而，目前关于真实世界研究的文献几乎没有提供关于如何高效、合乎道德地设置和执行数据清理的指导。为了解决这个问题，我们提出了一个用于现实世界研究的数据清理框架，重点关注3种最常见的脏数据类型（重复、丢失和异常数据），以及一个正常的数据清理工作流程，以作为此类技术在未来研究中应用的参考。我们还针对数据清理中的常见问题提供了相关建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interactive Journal of Medical Research MEDICINE, RESEARCH & EXPERIMENTAL-

自引率

0.00%

发文量

审稿时长

12 weeks