{"title":"使用结构化数据和预测分析的丙型肝炎综合诊断框架","authors":"Behnaz Motamedi, Balázs Villányi","doi":"10.1016/j.health.2025.100412","DOIUrl":null,"url":null,"abstract":"<div><div>This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"8 ","pages":"Article 100412"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comprehensive diagnostic framework for hepatitis C using structured data and predictive analytics\",\"authors\":\"Behnaz Motamedi, Balázs Villányi\",\"doi\":\"10.1016/j.health.2025.100412\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"8 \",\"pages\":\"Article 100412\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A comprehensive diagnostic framework for hepatitis C using structured data and predictive analytics
This study posits that a structured preprocessing and feature selection methodology might substantially improve the classification accuracy and generalizability of machine learning (ML) models in predicting stages of hepatitis C virus (HCV) using clinical and demographic data. The HCV is a chronic liver ailment characterized by many phases, necessitating precise and prompt categorization for optimal therapy. Although ML presents opportunities for stage prediction, issues such as class imbalance, missing data, and feature redundancy limit model efficacy and generalizability. To test this theory, we established an extensive four-phase preparation pipeline: Baseline imputes missing values using class-specific means; Refine mitigates outliers through class-specific medians and normalization; Balanced addresses class imbalance across five stages employing localized random affine shadow-sampling; and Augmented incorporates a clustering-based feature derived from an ensemble of K-means and Gaussian mixture models, combined with principal component analysis. The prediction model was developed by optimizing feature selection with the ReliefF approach and a random forest classifier employing random search. The resultant model exhibited outstanding performance, attaining an accuracy of 0.9983, precision of 0.9984, recall of 0.9983, F1-score of 0.9984, and Matthews correlation coefficient (MCC) of 0.9979 on the training set. It achieved an accuracy of 0.9977, precision of 0.9976, recall of 0.9981, F1-score of 0.9978, and MCC of 0.9973 on the independent test. The ensemble clustering component demonstrated reasonable validity, shown by an adjusted Rand index of 1.0, a moderate silhouette coefficient of 0.4702, and a Davies–Bouldin score of 1.1745, modestly outperforming individual clustering methods. The findings support the hypothesis and demonstrate that thorough preprocessing, stringent feature selection, and model optimization provide a highly accurate and generalizable tool for predicting HCV stages, hence improving clinical diagnosis and treatment strategies.