Harry Reyes Nieva, Jason Zucker, Emma Tucker, Jacob McLean, Clare DeLaurentis, Shauna Gunaratne, Noémie Elhadad
{"title":"Development of machine learning-based mpox surveillance models in a learning health system.","authors":"Harry Reyes Nieva, Jason Zucker, Emma Tucker, Jacob McLean, Clare DeLaurentis, Shauna Gunaratne, Noémie Elhadad","doi":"10.1136/sextrans-2024-056382","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to develop robust machine learning (ML)-based and deep learning (DL)-based models capable of detecting mpox cases for surveillance efforts using clinical notes.</p><p><strong>Methods: </strong>As part of a learning health system initiative, we conducted a retrospective study of clinical encounters at the Columbia University Irving Medical Center in New York City. We included patients with mpox diagnoses confirmed by PCR testing between 15 May 2022 and 15 October 2022 and three matched controls for each case based on patient age, sex, race, ethnicity and visit month. We trained three mpox surveillance models using: (1) logistic regression with L1 regularisation (least absolute shrinkage and selection operator (LASSO)), (2) ClinicalBERT and (3) ClinicalLongformer. We evaluated model performance using precision, recall, F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC) and recall at 80% precision (RP80).</p><p><strong>Results: </strong>The study included 228 PCR-confirmed mpox cases and 698 controls. LASSO regression outperformed the DL models with a precision, recall and F1 score of 0.93, AUROC of 0.97, AUPRC of 0.93 and RP80 of 0.89. ClinicalBERT achieved a precision of 0.88, recall of 0.89, F1 score of 0.88 and AUROC of 0.93. ClinicalLongformer achieved a precision of 0.87, recall of 0.88, F1 score of 0.87 and AUROC of 0.92. Phrases related to symptoms (eg, lesions and pain) were among the most predictive features in LASSO regression.</p><p><strong>Conclusions: </strong>ML and DL models based on clinical notes show promise for identifying mpox cases. In this study, LASSO regression outperformed DL models and excelled in minimising false positives. These findings highlight the potential for ML and DL methods to support case surveillance for mpox and other infectious diseases. These methods may also prove helpful for flagging missed or delayed diagnoses as part of continuous quality improvement.</p>","PeriodicalId":21624,"journal":{"name":"Sexually Transmitted Infections","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12353557/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sexually Transmitted Infections","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/sextrans-2024-056382","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study aimed to develop robust machine learning (ML)-based and deep learning (DL)-based models capable of detecting mpox cases for surveillance efforts using clinical notes.
Methods: As part of a learning health system initiative, we conducted a retrospective study of clinical encounters at the Columbia University Irving Medical Center in New York City. We included patients with mpox diagnoses confirmed by PCR testing between 15 May 2022 and 15 October 2022 and three matched controls for each case based on patient age, sex, race, ethnicity and visit month. We trained three mpox surveillance models using: (1) logistic regression with L1 regularisation (least absolute shrinkage and selection operator (LASSO)), (2) ClinicalBERT and (3) ClinicalLongformer. We evaluated model performance using precision, recall, F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC) and recall at 80% precision (RP80).
Results: The study included 228 PCR-confirmed mpox cases and 698 controls. LASSO regression outperformed the DL models with a precision, recall and F1 score of 0.93, AUROC of 0.97, AUPRC of 0.93 and RP80 of 0.89. ClinicalBERT achieved a precision of 0.88, recall of 0.89, F1 score of 0.88 and AUROC of 0.93. ClinicalLongformer achieved a precision of 0.87, recall of 0.88, F1 score of 0.87 and AUROC of 0.92. Phrases related to symptoms (eg, lesions and pain) were among the most predictive features in LASSO regression.
Conclusions: ML and DL models based on clinical notes show promise for identifying mpox cases. In this study, LASSO regression outperformed DL models and excelled in minimising false positives. These findings highlight the potential for ML and DL methods to support case surveillance for mpox and other infectious diseases. These methods may also prove helpful for flagging missed or delayed diagnoses as part of continuous quality improvement.
期刊介绍:
Sexually Transmitted Infections is the world’s longest running international journal on sexual health. It aims to keep practitioners, trainees and researchers up to date in the prevention, diagnosis and treatment of all STIs and HIV. The journal publishes original research, descriptive epidemiology, evidence-based reviews and comment on the clinical, public health, sociological and laboratory aspects of sexual health from around the world. We also publish educational articles, letters and other material of interest to readers, along with podcasts and other online material. STI provides a high quality editorial service from submission to publication.