Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.

IF 1.9 Q3 MEDICINE, RESEARCH & EXPERIMENTAL
Manping Guo, Yiming Wang, Qiaoning Yang, Rui Li, Yang Zhao, Chenfei Li, Mingbo Zhu, Yao Cui, Xin Jiang, Song Sheng, Qingna Li, Rui Gao
{"title":"Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint.","authors":"Manping Guo,&nbsp;Yiming Wang,&nbsp;Qiaoning Yang,&nbsp;Rui Li,&nbsp;Yang Zhao,&nbsp;Chenfei Li,&nbsp;Mingbo Zhu,&nbsp;Yao Cui,&nbsp;Xin Jiang,&nbsp;Song Sheng,&nbsp;Qingna Li,&nbsp;Rui Gao","doi":"10.2196/44310","DOIUrl":null,"url":null,"abstract":"<p><p>With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a \"data disaster.\" Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting \"dirty data,\" which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.</p>","PeriodicalId":51757,"journal":{"name":"Interactive Journal of Medical Research","volume":"12 ","pages":"e44310"},"PeriodicalIF":1.9000,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557005/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interactive Journal of Medical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/44310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

With the rapid development of science, technology, and engineering, large amounts of data have been generated in many fields in the past 20 years. In the process of medical research, data are constantly generated, and large amounts of real-world data form a "data disaster." Effective data analysis and mining are based on data availability and high data quality. The premise of high data quality is the need to clean the data. Data cleaning is the process of detecting and correcting "dirty data," which is the basis of data analysis and management. Moreover, data cleaning is a common technology for improving data quality. However, the current literature on real-world research provides little guidance on how to efficiently and ethically set up and perform data cleaning. To address this issue, we proposed a data cleaning framework for real-world research, focusing on the 3 most common types of dirty data (duplicate, missing, and outlier data), and a normal workflow for data cleaning to serve as a reference for the application of such technologies in future studies. We also provided relevant suggestions for common problems in data cleaning.

面向真实世界数据的正常工作流程和数据清理的关键策略:观点。
在过去的20年里,随着科学、技术和工程的快速发展,在许多领域产生了大量的数据。在医学研究过程中,数据不断生成,大量真实世界的数据形成了“数据灾难”。有效的数据分析和挖掘基于数据可用性和高数据质量。高数据质量的前提是需要清理数据。数据清理是检测和纠正“脏数据”的过程,是数据分析和管理的基础。此外,数据清理是提高数据质量的常用技术。然而,目前关于真实世界研究的文献几乎没有提供关于如何高效、合乎道德地设置和执行数据清理的指导。为了解决这个问题,我们提出了一个用于现实世界研究的数据清理框架,重点关注3种最常见的脏数据类型(重复、丢失和异常数据),以及一个正常的数据清理工作流程,以作为此类技术在未来研究中应用的参考。我们还针对数据清理中的常见问题提供了相关建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Interactive Journal of Medical Research
Interactive Journal of Medical Research MEDICINE, RESEARCH & EXPERIMENTAL-
自引率
0.00%
发文量
45
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信