Eric A. Goldberg MS, Connor J. Ross BS, Vivian Paraskevi Douglas MD, DVM, Alexander Ivanov MS, Tobias Elze PhD, Joan W. Miller MD, Alice C. Lorch MD, MPH
{"title":"Data Duplication and Errors in Large Medical Data Sets: A Case Study in the IRIS® Registry","authors":"Eric A. Goldberg MS, Connor J. Ross BS, Vivian Paraskevi Douglas MD, DVM, Alexander Ivanov MS, Tobias Elze PhD, Joan W. Miller MD, Alice C. Lorch MD, MPH","doi":"10.1016/j.xops.2025.100933","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To investigate entry errors and data duplication within the American Academy of Ophthalmology IRIS® Registry (Intelligent Research in Sight) utilizing cataract surgery (CS), neodymium-doped: yttrium aluminum garnet (YAG) capsulotomy, age-related macular degeneration (AMD), and diabetic retinopathy (DR) records.</div></div><div><h3>Design</h3><div>Retrospective cohort study.</div></div><div><h3>Participants</h3><div>Patients in the IRIS Registry.</div></div><div><h3>Methods</h3><div>We collected records of CS and YAG capsulotomy with specified laterality within the IRIS Registry (years 2013–2023), identifying eyes having >1 record and eyes having ≥1 record <em>on a date after the first entry</em> (different date duplication, <em>D</em><sub><em>d</em></sub>). Additionally, we identified eyes amongst records of DR and AMD with (1) a diagnosis indicating a more severe stage then reversion to the less severe stage or (2) a transition to a more severe stage before later being diagnosed with the less severe stage, defined as transition errors. We investigated potential predictors of <em>D</em><sub><em>d</em></sub> and transition errors among patient and practice characteristics by evaluating the permutation feature importance (PFI) of classification models.</div></div><div><h3>Main Outcome Measures</h3><div>For CS and YAG capsulotomy, we measure the proportion of eyes having >1 procedure record, having >1 record only on the initial procedure date, and having ≥1 procedure record on a date after the first entry. For DR and AMD, we measure the proportion of eyes reverting to an earlier stage after starting at a later stage and the proportion reverting to an earlier stage after transitioning to a later stage.</div></div><div><h3>Results</h3><div>Of the 14 718 896 CS-treated eyes, 30.9% had duplicates, with 5.5% having <em>D</em><sub><em>d</em></sub>. For YAG capsulotomy, out of 5 113 679 eyes, 29.1% had duplicates, with 4.1% having <em>D</em><sub><em>d</em></sub>. For AMD and DR, 13.6% and 12.7% of eyes, respectively, exhibited transition errors. Models captured a relationship between the eye’s first practice on record and the data errors under study, indicated by F1-loss = 0.230 (<em>D</em><sub><em>d</em></sub> model), 0.062 (transition error model) on average by PFI.</div></div><div><h3>Conclusions</h3><div>Data duplication in large medical data sets necessitates caution when analyzing repeated procedures or relapsing conditions. Addressing problematic errors requires transparency and communication amongst stakeholders across organizations. Within the IRIS Registry, the results indicated an association between the first record’s originating practice and data errors, providing an investigative entry point for upstream data stewards.</div></div><div><h3>Financial Disclosure(s)</h3><div>Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.</div></div>","PeriodicalId":74363,"journal":{"name":"Ophthalmology science","volume":"6 1","pages":"Article 100933"},"PeriodicalIF":4.6000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmology science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666914525002313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
To investigate entry errors and data duplication within the American Academy of Ophthalmology IRIS® Registry (Intelligent Research in Sight) utilizing cataract surgery (CS), neodymium-doped: yttrium aluminum garnet (YAG) capsulotomy, age-related macular degeneration (AMD), and diabetic retinopathy (DR) records.
Design
Retrospective cohort study.
Participants
Patients in the IRIS Registry.
Methods
We collected records of CS and YAG capsulotomy with specified laterality within the IRIS Registry (years 2013–2023), identifying eyes having >1 record and eyes having ≥1 record on a date after the first entry (different date duplication, Dd). Additionally, we identified eyes amongst records of DR and AMD with (1) a diagnosis indicating a more severe stage then reversion to the less severe stage or (2) a transition to a more severe stage before later being diagnosed with the less severe stage, defined as transition errors. We investigated potential predictors of Dd and transition errors among patient and practice characteristics by evaluating the permutation feature importance (PFI) of classification models.
Main Outcome Measures
For CS and YAG capsulotomy, we measure the proportion of eyes having >1 procedure record, having >1 record only on the initial procedure date, and having ≥1 procedure record on a date after the first entry. For DR and AMD, we measure the proportion of eyes reverting to an earlier stage after starting at a later stage and the proportion reverting to an earlier stage after transitioning to a later stage.
Results
Of the 14 718 896 CS-treated eyes, 30.9% had duplicates, with 5.5% having Dd. For YAG capsulotomy, out of 5 113 679 eyes, 29.1% had duplicates, with 4.1% having Dd. For AMD and DR, 13.6% and 12.7% of eyes, respectively, exhibited transition errors. Models captured a relationship between the eye’s first practice on record and the data errors under study, indicated by F1-loss = 0.230 (Dd model), 0.062 (transition error model) on average by PFI.
Conclusions
Data duplication in large medical data sets necessitates caution when analyzing repeated procedures or relapsing conditions. Addressing problematic errors requires transparency and communication amongst stakeholders across organizations. Within the IRIS Registry, the results indicated an association between the first record’s originating practice and data errors, providing an investigative entry point for upstream data stewards.
Financial Disclosure(s)
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.