A Data Element-Function Conceptual Model for Data Quality Checks.

James R Rogers, Tiffany J Callahan, Tian Kang, Alan Bauck, Ritu Khare, Jeffrey S Brown, Michael G Kahn, Chunhua Weng
{"title":"A Data Element-Function Conceptual Model for Data Quality Checks.","authors":"James R Rogers,&nbsp;Tiffany J Callahan,&nbsp;Tian Kang,&nbsp;Alan Bauck,&nbsp;Ritu Khare,&nbsp;Jeffrey S Brown,&nbsp;Michael G Kahn,&nbsp;Chunhua Weng","doi":"10.5334/egems.289","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.</p><p><strong>Methods: </strong>The model defines a \"data element\", the primary focus of the check, and a \"function\", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).</p><p><strong>Results: </strong>The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).</p><p><strong>Conclusions: </strong>This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific \"fitness-for-use\" checks.</p>","PeriodicalId":72880,"journal":{"name":"EGEMS (Washington, DC)","volume":"7 1","pages":"17"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484368/pdf/","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EGEMS (Washington, DC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5334/egems.289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Introduction: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.

Methods: The model defines a "data element", the primary focus of the check, and a "function", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).

Results: The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).

Conclusions: This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific "fitness-for-use" checks.

用于数据质量检查的数据元素-功能概念模型。
简介:总的来说,现有的数据质量(DQ)检查目前以异构格式表示,这使得比较、分类和索引检查变得困难。本研究提出了一个数据元素-功能概念模型,以促进DQ检查的分类和索引,并探讨了利用自然语言处理(NLP)从DQ检查叙述中可扩展地获取常见数据元素和功能知识的可行性。方法:模型定义了一个“数据元素”(检查的主要焦点)和一个“功能”(对数据元素的定性或定量度量)。我们应用NLP技术从172个健康数据科学与信息学(OHDSI)检查和3434个Kaiser Permanente有效性与安全研究中心(CESR)检查中提取。结果:该模型能够对所有检查进行分类。共提取了751个唯一数据元素和24个唯一函数。OHDSI中最常见的5个数据元素-功能配对是:人数-计数(55次检查)、保险-分布(17次)、药物-计数(16次)、条件-计数(14次)和观察-计数(13次);CESR为药物变量类型(175)、药物缺失(172)、药物存在(152)、药物计数(127)和社会经济因素变量类型(114)。结论:本研究显示了数据元素-功能概念模型对DQ检查分类的有效性,展示了nlp辅助知识获取的早期前景,并揭示了DQ检查焦点的巨大异质性,确认了内在检查和特定用例“适合使用”检查的差异。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信