预测个体COVID-19风险状态的可解释机器学习方法

A. Onoja, M. Ejiwale, Ayesan Rewane
{"title":"预测个体COVID-19风险状态的可解释机器学习方法","authors":"A. Onoja, M. Ejiwale, Ayesan Rewane","doi":"10.14738/TNC.92.9760","DOIUrl":null,"url":null,"abstract":"This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution.  The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960.  The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.","PeriodicalId":448328,"journal":{"name":"Transactions on Networks and Communications","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interpretable machine learning approach for predicting COVID-19 risk status of an individual\",\"authors\":\"A. Onoja, M. Ejiwale, Ayesan Rewane\",\"doi\":\"10.14738/TNC.92.9760\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution.  The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960.  The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.\",\"PeriodicalId\":448328,\"journal\":{\"name\":\"Transactions on Networks and Communications\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transactions on Networks and Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14738/TNC.92.9760\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Networks and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14738/TNC.92.9760","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本研究旨在利用统计特征选择方法和可解释的机器学习模型确定预测COVID-19感染风险状态(“低”、“中”、“高”)的最佳特征。本研究利用了通过以下途径获得的公开数据集;在线风险评估计算器,以确定COVID-19感染的风险状况。59个特征中的57个首先使用Pearson相关系数进行多重共线性过滤,并通过LASSO GLM方法进一步缩小到55个特征。采用SMOTE重采样技术导致了类分布不平衡的问题。在分类阶段提出了可解释的ML算法。最好的分类器预测被保存为一个新的实例,并使用单个决策树分类器进行扰动。为了进一步建立最佳模型的信任和可解释性,使用XGBoost分类器作为全局代理模型来训练最佳模型的预测。XGBoost个人的解释是使用SHAP可解释的ai框架完成的。随机森林分类器通过特征选择对55个特征进行约简,验证准确率达到96.35%,成为最佳分类器模型。决策树分类器正确逼近最佳分类器,预测准确率得分为92.23%,马修相关系数为0.8960。XGBoost分类器以99.7%的预测分数逼近最佳分类器模型。该研究确定了COVID-19阳性、COVID-19接触者、COVID-19症状、卫生工作者和公共交通数量是预测个人暴露于COVID-19风险的五个最一致的特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Interpretable machine learning approach for predicting COVID-19 risk status of an individual
This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution.  The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960.  The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信