Kate Herr, Peixin Lu, Kessi Diamreyan, Huan Xu, Eneida Mendonca, K Nicole Weaver, Jing Chen
{"title":"Estimating prevalence of rare genetic disease diagnoses using electronic health records in a children's hospital.","authors":"Kate Herr, Peixin Lu, Kessi Diamreyan, Huan Xu, Eneida Mendonca, K Nicole Weaver, Jing Chen","doi":"10.1016/j.xhgg.2024.100341","DOIUrl":null,"url":null,"abstract":"<p><p>Rare genetic diseases (RGDs) affect a significant number of individuals, particularly in pediatric populations. This study investigates the efficacy of identifying RGD diagnoses through electronic health records (EHRs) and natural language processing (NLP) tools, and analyzes the prevalence of identified RGDs for potential underdiagnosis at Cincinnati Children's Hospital Medical Center (CCHMC). EHR data from 659,139 pediatric patients at CCHMC were utilized. Diagnoses corresponding to RGDs in Orphanet were identified using rule-based and machine learning-based NLP methods. Manual evaluation assessed the precision of the NLP strategies, with 100 diagnosis descriptions reviewed for each method. The rule-based method achieved a precision of 97.5% (95% CI: 91.5%, 99.4%), while the machine-learning-based method had a precision of 73.5% (95% CI: 63.6%, 81.6%). A manual chart review of 70 randomly selected patients with RGD diagnoses confirmed the diagnoses in 90.3% (95% CI: 82.0%, 95.2%) of cases. A total of 37,326 pediatric patients were identified with 977 RGD diagnoses based on the rule-based method, resulting in a prevalence of 5.66% in this population. While a majority of the disorders showed a higher prevalence at CCHMC compared with Orphanet, some diseases, such as 1p36 deletion syndrome, indicated potential underdiagnosis. Analyses further uncovered disparities in RGD prevalence and age of diagnosis across gender and racial groups. This study demonstrates the utility of employing EHR data with NLP tools to systematically investigate RGD diagnoses in large cohorts. The identified disparities underscore the need for enhanced approaches to guarantee timely and accurate diagnosis and management of pediatric RGDs.</p>","PeriodicalId":34530,"journal":{"name":"HGG Advances","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11401171/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HGG Advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.xhgg.2024.100341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/14 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Rare genetic diseases (RGDs) affect a significant number of individuals, particularly in pediatric populations. This study investigates the efficacy of identifying RGD diagnoses through electronic health records (EHRs) and natural language processing (NLP) tools, and analyzes the prevalence of identified RGDs for potential underdiagnosis at Cincinnati Children's Hospital Medical Center (CCHMC). EHR data from 659,139 pediatric patients at CCHMC were utilized. Diagnoses corresponding to RGDs in Orphanet were identified using rule-based and machine learning-based NLP methods. Manual evaluation assessed the precision of the NLP strategies, with 100 diagnosis descriptions reviewed for each method. The rule-based method achieved a precision of 97.5% (95% CI: 91.5%, 99.4%), while the machine-learning-based method had a precision of 73.5% (95% CI: 63.6%, 81.6%). A manual chart review of 70 randomly selected patients with RGD diagnoses confirmed the diagnoses in 90.3% (95% CI: 82.0%, 95.2%) of cases. A total of 37,326 pediatric patients were identified with 977 RGD diagnoses based on the rule-based method, resulting in a prevalence of 5.66% in this population. While a majority of the disorders showed a higher prevalence at CCHMC compared with Orphanet, some diseases, such as 1p36 deletion syndrome, indicated potential underdiagnosis. Analyses further uncovered disparities in RGD prevalence and age of diagnosis across gender and racial groups. This study demonstrates the utility of employing EHR data with NLP tools to systematically investigate RGD diagnoses in large cohorts. The identified disparities underscore the need for enhanced approaches to guarantee timely and accurate diagnosis and management of pediatric RGDs.