Beyond the ‘black box’: choosing interpretable machine learning models for predicting postoperative opioid trends

IF 7.5 1区医学 Q1 ANESTHESIOLOGY

Anaesthesia Pub Date : 2025-02-02 DOI:10.1111/anae.16553

Seshadri C. Mudumbai, James Baurley, Caitlin E. Coombes, Randall S. Stafford, Edward R. Mariano

{"title":"Beyond the ‘black box’: choosing interpretable machine learning models for predicting postoperative opioid trends","authors":"Seshadri C. Mudumbai, James Baurley, Caitlin E. Coombes, Randall S. Stafford, Edward R. Mariano","doi":"10.1111/anae.16553","DOIUrl":null,"url":null,"abstract":"Artificial intelligence encompasses machine learning and is a popular, yet controversial, topic in healthcare. Recent guidelines from national regulatory agencies underscore the critical importance of interpretability in machine learning models used in healthcare [1]. ‘Interpretability’ means that clinicians understand the reasoning behind a model's predictions, fostering trust and enabling informed clinical decision-making [Doshi-Velez et al. preprint, https://arxiv.org/abs/1702.08608]. In response to the opioid epidemic, there has been interest in using machine learning models to predict which patients will have the highest risk of postoperative opioid dependence. To be interpretable, clinicians should be able to see which specific factors (e.g. previous opioid use, type of surgery or mental health conditions) contribute to prediction. Experts have advocated for building inherently interpretable models from the start, especially in high-stakes medical contexts, rather than retrofitting explanations onto complex models after development [2]. As machine learning algorithms become integral to peri-operative management, balancing model complexity with interpretability is crucial [3]. The objective of this study was to evaluate whether simpler, more interpretable models could match complex ones in predictive accuracy and in identifying key predictors for postoperative opioid use.Following institutional review board approval, we conducted a retrospective cohort study at a US Veterans Affairs hospital. We included adult patients who had surgery from 2015 to 2021 and had documented pre-operative and post-discharge opioid prescriptions. Patients without complete opioid prescription data were not studied.Baseline data were extracted from electronic health records and included: patient characteristics; clinical variables (such as type of surgery and duration of hospital stay); and mental health diagnoses. We assessed three outcomes, with mean daily morphine milligram equivalents (MME) as the primary outcome and variance in MME and monthly rate of change in MME as secondary outcomes; these were all measured over 12 months before surgery and post-discharge. Opioid prescriptions were converted to MME, and mental health diagnoses were identified using ICD-10 revision codes as described in previous studies [4].We developed three machine learning models to predict postoperative opioid use: lasso regression, which enhances accuracy and interpretability through variable selection and regularisation; decision tree, which predicts outcomes using interpretable decision rules inferred from data; and extreme gradient boosting (XGBoost), an ensemble method known for high predictive performance but lower interpretability [5].Analyses were performed using RStudio (version 12.0, R Foundation for Statistical Computing, Vienna, Austria) involving two scenarios: models were trained using only baseline predictors without pre-operative opioid use data; and models included all baseline predictors plus pre-operative opioid use metrics. We utilised the interpretable machine learning package for feature importance analysis, with the rpart and XGBoost packages used for model implementation. Hyperparameters were optimised via grid search and cross-validation. Ten-fold cross-validation minimised overfitting and assessed generalisability. The primary evaluation metric was root mean squared error (RMSE) and mean absolute error (MAE) was also calculated. Feature importance was determined by coefficient magnitude (lasso regression), tree structure (decision tree) and built-in importance measures (XGBoost), with p < 0.05 defined as statistically significant.The study cohort consisted of 1396 patients who were predominantly male (93.6%), aged > 70 y (49.4%) and White (77.4%) (online Supporting Information Table S1). Half of the cohort had a diagnosed mental illness, with major depression (58.2%) and substance use disorder (27.9%) being most prevalent. Surgery type varied, with orthopaedics (19.3%) and ophthalmology (9.3%) being common. The mean (SD) pre-operative MME was 681 (1340), indicating significant opioid use before surgery.Including pre-operative opioid metrics enhanced predictive accuracy across all models (Table 1). Lasso regression showed the greatest improvement (RMSE 1263 to 711, MAE 726 to 350, p < 0.01), followed by decision tree (RMSE 1286 to 787, MAE 709 to 363, p < 0.01), while XGBoost showed modest improvements (RMSE 1352 to 1168, MAE 600 to 528, p < 0.05). For secondary outcomes (online Supporting Information Table S2), models showed modest improvements in predicting opioid use variance, with XGBoost performing best (RMSE 2,540,888 to 2,299,404), while improvements in predicting monthly rate of change were minimal.Feature importance analysis revealed differences among models (online Supporting Information Figure S1). While XGBoost heavily weighted pre-operative mean MME, emphasising reliance on previous opioid use patterns, decision tree and lasso regression identified additional important predictors. Decision tree highlighted surgical type and duration of hospital stay alongside pre-operative opioid metrics, while lasso regression emphasised mental health diagnoses and duration of hospital stay as influential predictors.Our study shows that simpler models can predict postoperative opioid trends effectively and provide valuable insights into key predictors [6]. Notably, lasso regression and decision tree models identified clinically relevant factors beyond opioid use history, achieving comparable accuracy while offering greater potential interpretability. The lack of transparency in complex models may limit clinical adoption and need further evaluation.The Veterans Affairs healthcare population is known to have higher prevalence rates of mental illness, pre-operative opioid use and substance use disorders compared with typical surgical populations in the USA. Recent studies of general surgical populations in the USA report pre-operative opioid use rates of 10–30% and mental health diagnosis rates of 10–35%, compared with rates in our cohort of 78% and > 50%, respectively [6-8]. The 94.1% prevalence of chronic pain in our cohort is also notably higher than in general surgical populations (typically 25–40%). These characteristics of US veterans, particularly among those seeking surgical care at VA healthcare facilities, are important to note [4, 7]. While these features may limit the generalisability of any inferences, from the perspective of our study purpose, higher prevalence rates contribute to a richer dataset for evaluating and comparing the ability of these models to identify complex predictor relationships. Based on our results, prioritising simpler models and interpretability may enhance clinical utility without compromising performance [9]. Multicentre evaluation involving more diverse surgical populations will be necessary to validate these findings and assess model interpretability needs across different clinical settings.","PeriodicalId":7742,"journal":{"name":"Anaesthesia","volume":"80 4","pages":"451-453"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/anae.16553","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anaesthesia","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/anae.16553","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ANESTHESIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence encompasses machine learning and is a popular, yet controversial, topic in healthcare. Recent guidelines from national regulatory agencies underscore the critical importance of interpretability in machine learning models used in healthcare [1]. ‘Interpretability’ means that clinicians understand the reasoning behind a model's predictions, fostering trust and enabling informed clinical decision-making [Doshi-Velez et al. preprint, https://arxiv.org/abs/1702.08608]. In response to the opioid epidemic, there has been interest in using machine learning models to predict which patients will have the highest risk of postoperative opioid dependence. To be interpretable, clinicians should be able to see which specific factors (e.g. previous opioid use, type of surgery or mental health conditions) contribute to prediction. Experts have advocated for building inherently interpretable models from the start, especially in high-stakes medical contexts, rather than retrofitting explanations onto complex models after development [2]. As machine learning algorithms become integral to peri-operative management, balancing model complexity with interpretability is crucial [3]. The objective of this study was to evaluate whether simpler, more interpretable models could match complex ones in predictive accuracy and in identifying key predictors for postoperative opioid use.

Following institutional review board approval, we conducted a retrospective cohort study at a US Veterans Affairs hospital. We included adult patients who had surgery from 2015 to 2021 and had documented pre-operative and post-discharge opioid prescriptions. Patients without complete opioid prescription data were not studied.

Baseline data were extracted from electronic health records and included: patient characteristics; clinical variables (such as type of surgery and duration of hospital stay); and mental health diagnoses. We assessed three outcomes, with mean daily morphine milligram equivalents (MME) as the primary outcome and variance in MME and monthly rate of change in MME as secondary outcomes; these were all measured over 12 months before surgery and post-discharge. Opioid prescriptions were converted to MME, and mental health diagnoses were identified using ICD-10 revision codes as described in previous studies [4].

We developed three machine learning models to predict postoperative opioid use: lasso regression, which enhances accuracy and interpretability through variable selection and regularisation; decision tree, which predicts outcomes using interpretable decision rules inferred from data; and extreme gradient boosting (XGBoost), an ensemble method known for high predictive performance but lower interpretability [5].

Analyses were performed using RStudio (version 12.0, R Foundation for Statistical Computing, Vienna, Austria) involving two scenarios: models were trained using only baseline predictors without pre-operative opioid use data; and models included all baseline predictors plus pre-operative opioid use metrics. We utilised the interpretable machine learning package for feature importance analysis, with the rpart and XGBoost packages used for model implementation. Hyperparameters were optimised via grid search and cross-validation. Ten-fold cross-validation minimised overfitting and assessed generalisability. The primary evaluation metric was root mean squared error (RMSE) and mean absolute error (MAE) was also calculated. Feature importance was determined by coefficient magnitude (lasso regression), tree structure (decision tree) and built-in importance measures (XGBoost), with p < 0.05 defined as statistically significant.

The study cohort consisted of 1396 patients who were predominantly male (93.6%), aged > 70 y (49.4%) and White (77.4%) (online Supporting Information Table S1). Half of the cohort had a diagnosed mental illness, with major depression (58.2%) and substance use disorder (27.9%) being most prevalent. Surgery type varied, with orthopaedics (19.3%) and ophthalmology (9.3%) being common. The mean (SD) pre-operative MME was 681 (1340), indicating significant opioid use before surgery.

Including pre-operative opioid metrics enhanced predictive accuracy across all models (Table 1). Lasso regression showed the greatest improvement (RMSE 1263 to 711, MAE 726 to 350, p < 0.01), followed by decision tree (RMSE 1286 to 787, MAE 709 to 363, p < 0.01), while XGBoost showed modest improvements (RMSE 1352 to 1168, MAE 600 to 528, p < 0.05). For secondary outcomes (online Supporting Information Table S2), models showed modest improvements in predicting opioid use variance, with XGBoost performing best (RMSE 2,540,888 to 2,299,404), while improvements in predicting monthly rate of change were minimal.

Feature importance analysis revealed differences among models (online Supporting Information Figure S1). While XGBoost heavily weighted pre-operative mean MME, emphasising reliance on previous opioid use patterns, decision tree and lasso regression identified additional important predictors. Decision tree highlighted surgical type and duration of hospital stay alongside pre-operative opioid metrics, while lasso regression emphasised mental health diagnoses and duration of hospital stay as influential predictors.

Our study shows that simpler models can predict postoperative opioid trends effectively and provide valuable insights into key predictors [6]. Notably, lasso regression and decision tree models identified clinically relevant factors beyond opioid use history, achieving comparable accuracy while offering greater potential interpretability. The lack of transparency in complex models may limit clinical adoption and need further evaluation.

The Veterans Affairs healthcare population is known to have higher prevalence rates of mental illness, pre-operative opioid use and substance use disorders compared with typical surgical populations in the USA. Recent studies of general surgical populations in the USA report pre-operative opioid use rates of 10–30% and mental health diagnosis rates of 10–35%, compared with rates in our cohort of 78% and > 50%, respectively [6-8]. The 94.1% prevalence of chronic pain in our cohort is also notably higher than in general surgical populations (typically 25–40%). These characteristics of US veterans, particularly among those seeking surgical care at VA healthcare facilities, are important to note [4, 7]. While these features may limit the generalisability of any inferences, from the perspective of our study purpose, higher prevalence rates contribute to a richer dataset for evaluating and comparing the ability of these models to identify complex predictor relationships. Based on our results, prioritising simpler models and interpretability may enhance clinical utility without compromising performance [9]. Multicentre evaluation involving more diverse surgical populations will be necessary to validate these findings and assess model interpretability needs across different clinical settings.

查看原文本刊更多论文

超越“黑箱”：选择可解释的机器学习模型来预测术后阿片类药物的趋势

人工智能包括机器学习，是医疗保健领域一个流行但有争议的话题。国家监管机构的最新指导方针强调了医疗保健领域使用的机器学习模型的可解释性的至关重要性。“可解释性”意味着临床医生理解模型预测背后的原因，促进信任并使临床决策更加明智[Doshi-Velez等人的预印本，https://arxiv.org/abs/1702.08608]]。为了应对阿片类药物的流行，人们一直有兴趣使用机器学习模型来预测哪些患者术后阿片类药物依赖的风险最高。为了便于解释，临床医生应该能够看到哪些特定因素（例如，以前使用阿片类药物、手术类型或精神健康状况）有助于预测。专家们主张从一开始就建立内在可解释的模型，特别是在高风险的医疗环境中，而不是在开发后对复杂的模型进行改造。随着机器学习算法成为围手术期管理不可或缺的一部分，平衡模型复杂性和可解释性至关重要。本研究的目的是评估更简单、更可解释的模型是否可以在预测准确性和识别术后阿片类药物使用的关键预测因素方面与复杂的模型相匹配。在机构审查委员会批准后，我们在美国退伍军人事务医院进行了一项回顾性队列研究。我们纳入了2015年至2021年接受手术并记录术前和出院后阿片类药物处方的成年患者。没有完整阿片类药物处方数据的患者未被研究。从电子健康记录中提取基线数据，包括：患者特征；临床变量（如手术类型和住院时间）；以及心理健康诊断。我们评估了三个结局，日均吗啡毫克当量（MME）作为主要结局，MME的方差和MME的月变化率作为次要结局；这些都是在手术前和出院后12个月内测量的。阿片类药物处方被转换为MME，并使用先前研究中描述的ICD-10修订代码确定心理健康诊断bbb。我们开发了三种机器学习模型来预测术后阿片类药物的使用：套索回归，通过变量选择和正则化提高准确性和可解释性；决策树，使用从数据中推断出的可解释决策规则来预测结果；极端梯度增强（XGBoost），一种已知具有高预测性能但可解释性较低的集成方法。使用RStudio （version 12.0, R Foundation for Statistical Computing, Vienna, Austria）进行分析，涉及两种情况：仅使用基线预测因子训练模型，不使用术前阿片类药物使用数据；模型包括所有基线预测指标和术前阿片类药物使用指标。我们使用可解释的机器学习包进行特征重要性分析，使用rpart和XGBoost包进行模型实现。通过网格搜索和交叉验证对超参数进行优化。十倍交叉验证最小化了过拟合并评估了通用性。主要评价指标为均方根误差（RMSE），并计算平均绝对误差（MAE）。特征重要性由系数大小（lasso回归）、树结构（决策树）和内置重要性度量（XGBoost）确定，p &lt； 0.05定义为具有统计学意义。该研究队列包括1396例患者，主要为男性（93.6%）、70岁（49.4%）和白人（77.4%）（在线支持信息表S1）。该队列中有一半被诊断患有精神疾病，其中重度抑郁症（58.2%）和物质使用障碍（27.9%）最为普遍。手术类型不同，以骨科（19.3%）和眼科（9.3%）最为常见。平均（SD）术前MME为681(1340)，表明术前有明显的阿片类药物使用。包括术前阿片类药物指标提高了所有模型的预测准确性（表1）。Lasso回归显示最大的改善（RMSE 1263至711，MAE 726至350，p < 0.01），其次是决策树（RMSE 1286至787，MAE 709至363，p < 0.01），而XGBoost显示适度改善（RMSE 1352至1168，MAE 600至528，p < 0.05）。对于次要结果（在线支持信息表S2），模型在预测阿片类药物使用方差方面显示出适度的改进，XGBoost表现最佳（RMSE 2,540,888至2,299,404），而预测每月变化率的改进很小。表1。有和没有术前阿片类药物数据的机器学习模型预测出院后平均阿片类药物使用的性能比较。模型使用两组预测因子进行训练：仅使用基线预测因子（例如人口统计学、手术类型、住院时间和精神健康诊断）；基线和术前阿片类药物指标（术前12个月阿片类药物使用）。值越低表示预测性能越好。modelmetrics基线预测指标+术前阿片类药物指标平均MME预测lasso回归rmse1263711mae726350决策树ermse1286787mae709363xgboostrmse13521168mae600528 MME，吗啡毫克当量；RMSE，均方根误差；MAE，平均绝对误差。特征重要性分析揭示了模型之间的差异（在线支持信息图S1）。虽然XGBoost对术前平均MME的权重很大，但强调对先前阿片类药物使用模式的依赖，决策树和套索回归确定了其他重要的预测因素。决策树强调手术类型和住院时间以及术前阿片类药物指标，而套索回归强调精神健康诊断和住院时间作为有影响的预测因素。我们的研究表明，更简单的模型可以有效地预测术后阿片类药物的趋势，并为关键预测因子[6]提供有价值的见解。值得注意的是，套索回归和决策树模型确定了阿片类药物使用史以外的临床相关因素，在提供更大潜在可解释性的同时达到了相当的准确性。复杂模型缺乏透明度可能会限制临床应用，需要进一步评估。众所周知，与美国典型的外科人群相比，退伍军人事务医疗保健人群的精神疾病、术前阿片类药物使用和物质使用障碍的患病率更高。最近对美国普通外科人群的研究报告称，术前阿片类药物使用率为10-30%，心理健康诊断率为10-35%，而我们的队列分别为78%和50%[6-8]。在我们的队列中，94.1%的慢性疼痛患病率也明显高于普通外科人群（通常为25-40%）。值得注意的是，美国退伍军人的这些特征，尤其是那些在VA医疗机构寻求外科治疗的退伍军人[4,7]。虽然这些特征可能会限制任何推断的普遍性，但从我们研究目的的角度来看，较高的患病率有助于提供更丰富的数据集，以评估和比较这些模型识别复杂预测关系的能力。根据我们的研究结果，优先考虑更简单的模型和可解释性可以在不影响性能的情况下提高临床效用。有必要进行多中心评估，包括更多不同的手术人群，以验证这些发现，并评估不同临床环境下模型的可解释性需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Anaesthesia 医学-麻醉学

CiteScore

21.20

自引率

9.30%

发文量

300

审稿时长

6 months

期刊介绍： The official journal of the Association of Anaesthetists is Anaesthesia. It is a comprehensive international publication that covers a wide range of topics. The journal focuses on general and regional anaesthesia, as well as intensive care and pain therapy. It includes original articles that have undergone peer review, covering all aspects of these fields, including research on equipment.