Khalid O Yusuf, Olga Miljukov, Anne Schoneberg, Sabine Hanß, Martin Wiesenfeldt, Melanie Stecher, Lazar Mitrov, Sina Marie Hopff, Sarah Steinbrecher, Florian Kurth, Thomas Bahmer, Stefan Schreiber, Daniel Pape, Anna-Lena Hofmann, Mirjam Kohls, Stefan Störk, Hans Christian Stubbe, Johannes J Tebbe, Johannes C Hellmuth, Johanna Erber, Lilian Krist, Siegbert Rieg, Lisa Pilgram, Jörg J Vehreschild, Jens-Peter Reese, Dagmar Krefting
{"title":"Consistency as a Data Quality Measure for German Corona Consensus Items Mapped from National Pandemic Cohort Network Data Collections.","authors":"Khalid O Yusuf, Olga Miljukov, Anne Schoneberg, Sabine Hanß, Martin Wiesenfeldt, Melanie Stecher, Lazar Mitrov, Sina Marie Hopff, Sarah Steinbrecher, Florian Kurth, Thomas Bahmer, Stefan Schreiber, Daniel Pape, Anna-Lena Hofmann, Mirjam Kohls, Stefan Störk, Hans Christian Stubbe, Johannes J Tebbe, Johannes C Hellmuth, Johanna Erber, Lilian Krist, Siegbert Rieg, Lisa Pilgram, Jörg J Vehreschild, Jens-Peter Reese, Dagmar Krefting","doi":"10.1055/a-2006-1086","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>As a national effort to better understand the current pandemic, three cohorts collect sociodemographic and clinical data from coronavirus disease 2019 (COVID-19) patients from different target populations within the German National Pandemic Cohort Network (NAPKON). Furthermore, the German Corona Consensus Dataset (GECCO) was introduced as a harmonized basic information model for COVID-19 patients in clinical routine. To compare the cohort data with other GECCO-based studies, data items are mapped to GECCO. As mapping from one information model to another is complex, an additional consistency evaluation of the mapped items is recommended to detect possible mapping issues or source data inconsistencies.</p><p><strong>Objectives: </strong>The goal of this work is to assure high consistency of research data mapped to the GECCO data model. In particular, it aims at identifying contradictions within interdependent GECCO data items of the German national COVID-19 cohorts to allow investigation of possible reasons for identified contradictions. We furthermore aim at enabling other researchers to easily perform data quality evaluation on GECCO-based datasets and adapt to similar data models.</p><p><strong>Methods: </strong>All suitable data items from each of the three NAPKON cohorts are mapped to the GECCO items. A consistency assessment tool (dqGecco) is implemented, following the design of an existing quality assessment framework, retaining their<i>-</i>defined consistency taxonomies, including logical and empirical contradictions. Results of the assessment are verified independently on the primary data source.</p><p><strong>Results: </strong>Our consistency assessment tool helped in correcting the mapping procedure and reveals remaining contradictory value combinations within COVID-19 symptoms, vital signs, and COVID-19 severity. Consistency rates differ between the different indicators and cohorts ranging from 95.84% up to 100%.</p><p><strong>Conclusion: </strong>An efficient and portable tool capable of discovering inconsistencies in the COVID-19 domain has been developed and applied to three different cohorts. As the GECCO dataset is employed in different platforms and studies, the tool can be directly applied there or adapted to similar information models.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"62 S 01","pages":"e47-e56"},"PeriodicalIF":1.3000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/4d/05/10-1055-a-2006-1086.PMC10306447.pdf","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2006-1086","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 2
Abstract
Background: As a national effort to better understand the current pandemic, three cohorts collect sociodemographic and clinical data from coronavirus disease 2019 (COVID-19) patients from different target populations within the German National Pandemic Cohort Network (NAPKON). Furthermore, the German Corona Consensus Dataset (GECCO) was introduced as a harmonized basic information model for COVID-19 patients in clinical routine. To compare the cohort data with other GECCO-based studies, data items are mapped to GECCO. As mapping from one information model to another is complex, an additional consistency evaluation of the mapped items is recommended to detect possible mapping issues or source data inconsistencies.
Objectives: The goal of this work is to assure high consistency of research data mapped to the GECCO data model. In particular, it aims at identifying contradictions within interdependent GECCO data items of the German national COVID-19 cohorts to allow investigation of possible reasons for identified contradictions. We furthermore aim at enabling other researchers to easily perform data quality evaluation on GECCO-based datasets and adapt to similar data models.
Methods: All suitable data items from each of the three NAPKON cohorts are mapped to the GECCO items. A consistency assessment tool (dqGecco) is implemented, following the design of an existing quality assessment framework, retaining their-defined consistency taxonomies, including logical and empirical contradictions. Results of the assessment are verified independently on the primary data source.
Results: Our consistency assessment tool helped in correcting the mapping procedure and reveals remaining contradictory value combinations within COVID-19 symptoms, vital signs, and COVID-19 severity. Consistency rates differ between the different indicators and cohorts ranging from 95.84% up to 100%.
Conclusion: An efficient and portable tool capable of discovering inconsistencies in the COVID-19 domain has been developed and applied to three different cohorts. As the GECCO dataset is employed in different platforms and studies, the tool can be directly applied there or adapted to similar information models.
期刊介绍:
Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.