用于数据质量检查的数据元素-功能概念模型。

EGEMS (Washington, DC) Pub Date : 2019-04-23 DOI:10.5334/egems.289

James R Rogers, Tiffany J Callahan, Tian Kang, Alan Bauck, Ritu Khare, Jeffrey S Brown, Michael G Kahn, Chunhua Weng

{"title":"用于数据质量检查的数据元素-功能概念模型。","authors":"James R Rogers, Tiffany J Callahan, Tian Kang, Alan Bauck, Ritu Khare, Jeffrey S Brown, Michael G Kahn, Chunhua Weng","doi":"10.5334/egems.289","DOIUrl":null,"url":null,"abstract":"Introduction: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.Methods: The model defines a \"data element\", the primary focus of the check, and a \"function\", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).Results: The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).Conclusions: This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific \"fitness-for-use\" checks.","PeriodicalId":72880,"journal":{"name":"EGEMS (Washington, DC)","volume":"7 1","pages":"17"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484368/pdf/","citationCount":"4","resultStr":"{\"title\":\"A Data Element-Function Conceptual Model for Data Quality Checks.\",\"authors\":\"James R Rogers, Tiffany J Callahan, Tian Kang, Alan Bauck, Ritu Khare, Jeffrey S Brown, Michael G Kahn, Chunhua Weng\",\"doi\":\"10.5334/egems.289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.Methods: The model defines a \\\"data element\\\", the primary focus of the check, and a \\\"function\\\", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).Results: The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).Conclusions: This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific \\\"fitness-for-use\\\" checks.\",\"PeriodicalId\":72880,\"journal\":{\"name\":\"EGEMS (Washington, DC)\",\"volume\":\"7 1\",\"pages\":\"17\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484368/pdf/\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EGEMS (Washington, DC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5334/egems.289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EGEMS (Washington, DC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5334/egems.289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

简介:总的来说，现有的数据质量(DQ)检查目前以异构格式表示，这使得比较、分类和索引检查变得困难。本研究提出了一个数据元素-功能概念模型，以促进DQ检查的分类和索引，并探讨了利用自然语言处理(NLP)从DQ检查叙述中可扩展地获取常见数据元素和功能知识的可行性。方法:模型定义了一个“数据元素”(检查的主要焦点)和一个“功能”(对数据元素的定性或定量度量)。我们应用NLP技术从172个健康数据科学与信息学(OHDSI)检查和3434个Kaiser Permanente有效性与安全研究中心(CESR)检查中提取。结果:该模型能够对所有检查进行分类。共提取了751个唯一数据元素和24个唯一函数。OHDSI中最常见的5个数据元素-功能配对是:人数-计数(55次检查)、保险-分布(17次)、药物-计数(16次)、条件-计数(14次)和观察-计数(13次);CESR为药物变量类型(175)、药物缺失(172)、药物存在(152)、药物计数(127)和社会经济因素变量类型(114)。结论:本研究显示了数据元素-功能概念模型对DQ检查分类的有效性，展示了nlp辅助知识获取的早期前景，并揭示了DQ检查焦点的巨大异质性，确认了内在检查和特定用例“适合使用”检查的差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A Data Element-Function Conceptual Model for Data Quality Checks.

查看原文本刊更多论文

A Data Element-Function Conceptual Model for Data Quality Checks.

Introduction: In aggregate, existing data quality (DQ) checks are currently represented in heterogeneous formats, making it difficult to compare, categorize, and index checks. This study contributes a data element-function conceptual model to facilitate the categorization and indexing of DQ checks and explores the feasibility of leveraging natural language processing (NLP) for scalable acquisition of knowledge of common data elements and functions from DQ checks narratives.

Methods: The model defines a "data element", the primary focus of the check, and a "function", the qualitative or quantitative measure over a data element. We applied NLP techniques to extract both from 172 checks for Observational Health Data Sciences and Informatics (OHDSI) and 3,434 checks for Kaiser Permanente's Center for Effectiveness and Safety Research (CESR).

Results: The model was able to classify all checks. A total of 751 unique data elements and 24 unique functions were extracted. The top five frequent data element-function pairings for OHDSI were Person-Count (55 checks), Insurance-Distribution (17), Medication-Count (16), Condition-Count (14), and Observations-Count (13); for CESR, they were Medication-Variable Type (175), Medication-Missing (172), Medication-Existence (152), Medication-Count (127), and Socioeconomic Factors-Variable Type (114).

Conclusions: This study shows the efficacy of the data element-function conceptual model for classifying DQ checks, demonstrates early promise of NLP-assisted knowledge acquisition, and reveals the great heterogeneity in the focus in DQ checks, confirming variation in intrinsic checks and use-case specific "fitness-for-use" checks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EGEMS (Washington, DC)

自引率

0.00%

发文量