Imputation methods on retrospective breast cancer data in Tanzania: A comparative study

Rahibu A. Abassi, Amina S. Msengwa, Rocky R. J. Akarro
{"title":"Imputation methods on retrospective breast cancer data in Tanzania: A comparative study","authors":"Rahibu A. Abassi, Amina S. Msengwa, Rocky R. J. Akarro","doi":"10.31579/2642-9756/118","DOIUrl":null,"url":null,"abstract":"Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.","PeriodicalId":93058,"journal":{"name":"Women health care and issues","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Women health care and issues","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31579/2642-9756/118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.
坦桑尼亚回顾性乳腺癌数据的归算方法:一项比较研究
背景:临床数据集存在数据缺失的风险,原因包括患者未能参加临床测量和测量记录仪的缺陷。缺失的数据会严重影响分析,在分析过程中由于遗漏不完整的记录而导致的偏差可能会导致结果可疑,特别是在数据集很小的情况下。本研究旨在比较几种输入方法在填补缺失数据方面的效率,以提高乳腺癌数据集的预测和分类精度。方法:采用五种方法,即序列均值、k近邻、热甲板、预测均值匹配、通过自举实现期望最大化以及通过链式方程进行多重imputation,以取代真实乳腺癌数据集的缺失值。利用均方根误差和平均绝对误差比较了两种方法的有效性,得到了合适的完整数据集。将二元逻辑回归和线性判别分类器应用于输入数据集,比较其分类和判别效果。结果:预测均值匹配法与其他方法相比具有较好的效果。此外,二元逻辑回归和线性判别分析在总体分类率、敏感性和特异性上产生几乎相似的值。结论:在本研究下,预测均值匹配imputation在估计和替换真实乳腺癌数据集中缺失的数据值方面具有更高的准确性。这是处理丢失数据的更有效和更好的方法。我们建议通过使用预测均值匹配来替换缺失的数据,因为这是一种对数值变量进行多重输入的可行方法。它提高了使用完整案例分析的估计和预测精度,特别是当丢失数据的百分比不是很小的时候。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信