Imputation methods on retrospective breast cancer data in Tanzania: A comparative study

Women health care and issues Pub Date : 2022-06-06 DOI:10.31579/2642-9756/118

Rahibu A. Abassi, Amina S. Msengwa, Rocky R. J. Akarro

{"title":"Imputation methods on retrospective breast cancer data in Tanzania: A comparative study","authors":"Rahibu A. Abassi, Amina S. Msengwa, Rocky R. J. Akarro","doi":"10.31579/2642-9756/118","DOIUrl":null,"url":null,"abstract":"Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.","PeriodicalId":93058,"journal":{"name":"Women health care and issues","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Women health care and issues","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31579/2642-9756/118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.

查看原文本刊更多论文

坦桑尼亚回顾性乳腺癌数据的归算方法:一项比较研究

背景:临床数据集存在数据缺失的风险，原因包括患者未能参加临床测量和测量记录仪的缺陷。缺失的数据会严重影响分析，在分析过程中由于遗漏不完整的记录而导致的偏差可能会导致结果可疑，特别是在数据集很小的情况下。本研究旨在比较几种输入方法在填补缺失数据方面的效率，以提高乳腺癌数据集的预测和分类精度。方法:采用五种方法，即序列均值、k近邻、热甲板、预测均值匹配、通过自举实现期望最大化以及通过链式方程进行多重imputation，以取代真实乳腺癌数据集的缺失值。利用均方根误差和平均绝对误差比较了两种方法的有效性，得到了合适的完整数据集。将二元逻辑回归和线性判别分类器应用于输入数据集，比较其分类和判别效果。结果:预测均值匹配法与其他方法相比具有较好的效果。此外，二元逻辑回归和线性判别分析在总体分类率、敏感性和特异性上产生几乎相似的值。结论:在本研究下，预测均值匹配imputation在估计和替换真实乳腺癌数据集中缺失的数据值方面具有更高的准确性。这是处理丢失数据的更有效和更好的方法。我们建议通过使用预测均值匹配来替换缺失的数据，因为这是一种对数值变量进行多重输入的可行方法。它提高了使用完整案例分析的估计和预测精度，特别是当丢失数据的百分比不是很小的时候。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Women health care and issues

自引率

0.00%

发文量