使用结构化数据和预测分析的丙型肝炎综合诊断框架

Healthcare analytics (New York, N.Y.) Pub Date : 2025-08-20 DOI:10.1016/j.health.2025.100412

Behnaz Motamedi, Balázs Villányi

{"title":"使用结构化数据和预测分析的丙型肝炎综合诊断框架","authors":"Behnaz Motamedi, Balázs Villányi","doi":"10.1016/j.health.2025.100412","DOIUrl":null,"url":null,"abstract":"<div><div>This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"8 ","pages":"Article 100412"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comprehensive diagnostic framework for hepatitis C using structured data and predictive analytics\",\"authors\":\"Behnaz Motamedi, Balázs Villányi\",\"doi\":\"10.1016/j.health.2025.100412\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"8 \",\"pages\":\"Article 100412\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本研究假设结构化的预处理和特征选择方法可以大大提高机器学习（ML）模型在使用临床和人口统计数据预测丙型肝炎病毒（HCV）阶段的分类准确性和泛化性。HCV是一种慢性肝脏疾病，其特点是有许多阶段，需要精确和及时的分类以获得最佳治疗。尽管机器学习为阶段预测提供了机会，但类不平衡、数据缺失和特征冗余等问题限制了模型的有效性和泛化性。为了验证这一理论，我们建立了一个广泛的四阶段准备流程：基线使用特定类别的方法估算缺失值；细化通过类特定的中位数和标准化减轻异常值；平衡解决了五个阶段的阶级不平衡，采用局部随机仿射阴影采样；而Augmented则结合了基于聚类的特征，该特征来源于K-means和高斯混合模型的集合，并结合了主成分分析。采用ReliefF方法优化特征选择，采用随机搜索的随机森林分类器建立预测模型。该模型在训练集上的准确率为0.9983，精密度为0.9984，召回率为0.9983，f1得分为0.9984，马修斯相关系数（MCC）为0.9979。独立检验的准确度为0.9977，精密度为0.9976，召回率为0.9981，f1分数为0.9978，MCC为0.9973。整体聚类成分具有合理的效度，调整后的Rand指数为1.0，剪影系数为0.4702，Davies-Bouldin得分为1.1745，略优于单个聚类方法。研究结果支持了这一假设，并表明彻底的预处理、严格的特征选择和模型优化为预测HCV分期提供了高度准确和可推广的工具，从而改善了临床诊断和治疗策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comprehensive diagnostic framework for hepatitis C using structured data and predictive analytics

This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

79 days