Development and Validation of ML-DQA - a Machine Learning Data Quality Assurance Framework for Healthcare

M. Sendak, Gaurav Sirdeshmukh, Timothy N. Ochoa, H. Premo, Linda Tang, Kira L. Niederhoffer, Sarah Reed, Kaivalya Deshpande, E. Sterrett, M. Bauer, L. Snyder, Afreen I. Shariff, D. Whellan, J. Riggio, D. Gaieski, Kristin M Corey, Megan Richards, M. Gao, M. Nichols, Bradley Heintze, William S Knechtle, W. Ratliff, S. Balu
{"title":"Development and Validation of ML-DQA - a Machine Learning Data Quality Assurance Framework for Healthcare","authors":"M. Sendak, Gaurav Sirdeshmukh, Timothy N. Ochoa, H. Premo, Linda Tang, Kira L. Niederhoffer, Sarah Reed, Kaivalya Deshpande, E. Sterrett, M. Bauer, L. Snyder, Afreen I. Shariff, D. Whellan, J. Riggio, D. Gaieski, Kristin M Corey, Megan Richards, M. Gao, M. Nichols, Bradley Heintze, William S Knechtle, W. Ratliff, S. Balu","doi":"10.48550/arXiv.2208.02670","DOIUrl":null,"url":null,"abstract":"The approaches by which the machine learning and clinical research communities utilize real world data (RWD), including data captured in the electronic health record (EHR), vary dramatically. While clinical researchers cautiously use RWD for clinical investigations, ML for healthcare teams consume public datasets with minimal scrutiny to develop new algorithms. This study bridges this gap by developing and validating ML-DQA, a data quality assurance framework grounded in RWD best practices. The ML-DQA framework is applied to five ML projects across two geographies, different medical conditions, and different cohorts. A total of 2,999 quality checks and 24 quality reports were generated on RWD gathered on 247,536 patients across the five projects. Five generalizable practices emerge: all projects used a similar method to group redundant data element representations; all projects used automated utilities to build diagnosis and medication data elements; all projects used a common library of rules-based transformations; all projects used a unified approach to assign data quality checks to data elements; and all projects used a similar approach to clinical adjudication. An average of 5.8 individuals, including clinicians, data scientists, and trainees, were involved in implementing ML-DQA for each project and an average of 23.4 data elements per project were either transformed or removed in response to ML-DQA. This study demonstrates the importance role of ML-DQA in healthcare projects and provides teams a framework to conduct these essential activities.","PeriodicalId":231229,"journal":{"name":"Machine Learning in Health Care","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning in Health Care","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.02670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The approaches by which the machine learning and clinical research communities utilize real world data (RWD), including data captured in the electronic health record (EHR), vary dramatically. While clinical researchers cautiously use RWD for clinical investigations, ML for healthcare teams consume public datasets with minimal scrutiny to develop new algorithms. This study bridges this gap by developing and validating ML-DQA, a data quality assurance framework grounded in RWD best practices. The ML-DQA framework is applied to five ML projects across two geographies, different medical conditions, and different cohorts. A total of 2,999 quality checks and 24 quality reports were generated on RWD gathered on 247,536 patients across the five projects. Five generalizable practices emerge: all projects used a similar method to group redundant data element representations; all projects used automated utilities to build diagnosis and medication data elements; all projects used a common library of rules-based transformations; all projects used a unified approach to assign data quality checks to data elements; and all projects used a similar approach to clinical adjudication. An average of 5.8 individuals, including clinicians, data scientists, and trainees, were involved in implementing ML-DQA for each project and an average of 23.4 data elements per project were either transformed or removed in response to ML-DQA. This study demonstrates the importance role of ML-DQA in healthcare projects and provides teams a framework to conduct these essential activities.
ML-DQA的开发和验证——医疗保健机器学习数据质量保证框架
机器学习和临床研究界利用现实世界数据(RWD)的方法,包括电子健康记录(EHR)中捕获的数据,差异很大。虽然临床研究人员谨慎地使用RWD进行临床调查,但医疗团队的ML使用公共数据集,以最少的审查来开发新算法。本研究通过开发和验证基于RWD最佳实践的数据质量保证框架ML-DQA,弥合了这一差距。ML- dqa框架应用于跨越两个地区、不同医疗条件和不同队列的五个ML项目。五个项目共收集了247,536名患者的RWD,共生成了2,999次质量检查和24份质量报告。出现了五个可推广的实践:所有项目都使用类似的方法对冗余数据元素表示进行分组;所有项目都使用自动化实用程序来构建诊断和药物数据元素;所有项目都使用基于规则的转换的公共库;所有项目都使用统一的方法为数据元素分配数据质量检查;所有的项目都使用了类似的方法来进行临床裁决。平均5.8个人,包括临床医生、数据科学家和学员,参与每个项目的ML-DQA实施,每个项目平均有23.4个数据元素被转换或删除,以响应ML-DQA。本研究证明了ML-DQA在医疗保健项目中的重要作用,并为团队提供了执行这些基本活动的框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信