Lorenzo Argante, Germain Lonnet, Emmanuel Aris, Jane Whelan
{"title":"Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea.","authors":"Lorenzo Argante, Germain Lonnet, Emmanuel Aris, Jane Whelan","doi":"10.1177/20552076251331895","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Gonorrhea is a sexually transmitted infection (STI) that, untreated, can result in debilitating complications such as pelvic inflammatory disease, pain, and infertility. A minority of cases are diagnosed in STI clinics in the United States. Gonorrhea is often asymptomatic and presumed to be substantially underdiagnosed and/or undertreated.</p><p><strong>Objectives: </strong>To generate and compare predictive machine learning (ML) models using administrative claims data to characterize young women in the general United States population who would be most likely to contract gonorrhea.</p><p><strong>Methods: </strong>Data were extracted from the Merative™ MarketScan<sup>®</sup> Commercial and Medicaid databases containing routinely collected administrative claims data. Women aged 16-35 years with two years of continuous observation between 1 January 2017 and 31 December 2018 were included. ML classification models were constructed based on logistic regression and tree-based algorithms.</p><p><strong>Results: </strong>Models constructed using tree-based algorithms such as XGBoost provided the best discriminatory results, but simpler ridge regressions models with splines also achieved reasonable discrimination, allowing for the identification of population subsets at increased risk of gonorrhea infection. A subset of 0.1% of the population identified by the XGBoost model had a 70-fold higher risk of gonorrhea than the general population. External validation applying the different models on a Medicaid dataset that was not included in developing the original models was checked and deemed acceptable.</p><p><strong>Conclusions: </strong>The models and methods presented here could facilitate the identification of women at high risk of contracting gonorrhea for whom targeted preventive measures may be most beneficial.</p>","PeriodicalId":51333,"journal":{"name":"DIGITAL HEALTH","volume":"11 ","pages":"20552076251331895"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11970062/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DIGITAL HEALTH","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/20552076251331895","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea.
Background: Gonorrhea is a sexually transmitted infection (STI) that, untreated, can result in debilitating complications such as pelvic inflammatory disease, pain, and infertility. A minority of cases are diagnosed in STI clinics in the United States. Gonorrhea is often asymptomatic and presumed to be substantially underdiagnosed and/or undertreated.
Objectives: To generate and compare predictive machine learning (ML) models using administrative claims data to characterize young women in the general United States population who would be most likely to contract gonorrhea.
Methods: Data were extracted from the Merative™ MarketScan® Commercial and Medicaid databases containing routinely collected administrative claims data. Women aged 16-35 years with two years of continuous observation between 1 January 2017 and 31 December 2018 were included. ML classification models were constructed based on logistic regression and tree-based algorithms.
Results: Models constructed using tree-based algorithms such as XGBoost provided the best discriminatory results, but simpler ridge regressions models with splines also achieved reasonable discrimination, allowing for the identification of population subsets at increased risk of gonorrhea infection. A subset of 0.1% of the population identified by the XGBoost model had a 70-fold higher risk of gonorrhea than the general population. External validation applying the different models on a Medicaid dataset that was not included in developing the original models was checked and deemed acceptable.
Conclusions: The models and methods presented here could facilitate the identification of women at high risk of contracting gonorrhea for whom targeted preventive measures may be most beneficial.