{"title":"Beyond predictive accuracy: Statistical validation of feature importance in biomedical machine learning","authors":"Souichi Oka , Nobuko Inoue , Yoshiyasu Takefuji","doi":"10.1016/j.cmpb.2025.109085","DOIUrl":null,"url":null,"abstract":"<div><div>In medical machine learning (ML), a fundamental methodological distinction exists between optimizing model performance for predictive tasks and pursuing causal inference for mechanistic interpretation. Achieving high predictive accuracy does not necessarily imply that a model can uncover the true physiological mechanisms underlying the data. This letter addresses a critical interpretational challenge in medical machine learning, building upon Yuyang Yan et al.’s valuable work on exacerbation classification in asthma and COPD. While their multi-feature fusion model, particularly comprising models such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), and Bidirectional Long Short-Term Memory (BiLSTM) demonstrates high predictive accuracy for respiratory exacerbations, we highlight that such performance alone does not guarantee reliable insights into feature importance. Complex tree-based models like RF, when interpreted via methods like SHapley Additive exPlanations (SHAP), can exhibit inherent biases, overemphasizing features used in early splits and reflecting what is important for their specific prediction rather than the true underlying physiological drivers. Validating feature importance remains challenging without ground truth, as different models often yield varying rankings. We argue that solely relying on model-dependent interpretations risks misrepresenting the actual mechanisms of complex medical phenomena. Therefore, we advocate for a robust analytical strategy that transcends mere predictive metrics. This involves a synergistic approach combining the predictive power of ML with impartial, complementary statistical methodologies—such as non-parametric correlation and mutual information—to ensure genuinely trustworthy scientific insights into the true drivers of respiratory exacerbations.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"272 ","pages":"Article 109085"},"PeriodicalIF":4.8000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725005024","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
In medical machine learning (ML), a fundamental methodological distinction exists between optimizing model performance for predictive tasks and pursuing causal inference for mechanistic interpretation. Achieving high predictive accuracy does not necessarily imply that a model can uncover the true physiological mechanisms underlying the data. This letter addresses a critical interpretational challenge in medical machine learning, building upon Yuyang Yan et al.’s valuable work on exacerbation classification in asthma and COPD. While their multi-feature fusion model, particularly comprising models such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), and Bidirectional Long Short-Term Memory (BiLSTM) demonstrates high predictive accuracy for respiratory exacerbations, we highlight that such performance alone does not guarantee reliable insights into feature importance. Complex tree-based models like RF, when interpreted via methods like SHapley Additive exPlanations (SHAP), can exhibit inherent biases, overemphasizing features used in early splits and reflecting what is important for their specific prediction rather than the true underlying physiological drivers. Validating feature importance remains challenging without ground truth, as different models often yield varying rankings. We argue that solely relying on model-dependent interpretations risks misrepresenting the actual mechanisms of complex medical phenomena. Therefore, we advocate for a robust analytical strategy that transcends mere predictive metrics. This involves a synergistic approach combining the predictive power of ML with impartial, complementary statistical methodologies—such as non-parametric correlation and mutual information—to ensure genuinely trustworthy scientific insights into the true drivers of respiratory exacerbations.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.