Missing Value Imputation for Remote Healthcare Data: A Case study of Portable Health Clinic System

2021 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC) Pub Date : 2021-12-13 DOI:10.1109/JAC-ECC54461.2021.9691308

Yosuke Imamura, N. Abedin, Lu Sixian, Shaira Tabassum, Ashir Ahmed

{"title":"Missing Value Imputation for Remote Healthcare Data: A Case study of Portable Health Clinic System","authors":"Yosuke Imamura, N. Abedin, Lu Sixian, Shaira Tabassum, Ashir Ahmed","doi":"10.1109/JAC-ECC54461.2021.9691308","DOIUrl":null,"url":null,"abstract":"This study aims to investigate the best method for imputing missing values in remote healthcare data set. Missing value means an empty field in a health record. It may occur for three major reasons- (i) the parameter was not measured (ii) measured but not recorded and (iii) lost during communications. Our case study, Portable Health Clinic (PHC) data has been collected from multiple regions, by different authorities in different time. PHC data contains manual errors too. Missing and erroneous data are problematic for data analysis and for making accurate predictions. Hence, it is necessary to detect and eliminate error data and also fill the empty fields. Missing value imputation methods are widely known for processing numerical data. PHC data has both numerical and categorical data which makes it difficult to impute. We came up with a new data processing mechanism to feed into existing machine learning algorithm. To test our idea, we used a complete PHC data set (numerical only) without any missing values. Then we generated missing values by randomly erasing a part of the data set. We used several existing imputation methods and our proposed method on the same target data set to compare their performances. It is found that the Mean Imputer, kNN and MissForest are not effective. Iterative Imputer predicted best in 7 features and ours in 4 cases. Therefore, it can be concluded that the effectiveness of imputation methods may vary depending on the specific data set and features. Our future work is to include the categorical data and monitor the performance.","PeriodicalId":354908,"journal":{"name":"2021 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JAC-ECC54461.2021.9691308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study aims to investigate the best method for imputing missing values in remote healthcare data set. Missing value means an empty field in a health record. It may occur for three major reasons- (i) the parameter was not measured (ii) measured but not recorded and (iii) lost during communications. Our case study, Portable Health Clinic (PHC) data has been collected from multiple regions, by different authorities in different time. PHC data contains manual errors too. Missing and erroneous data are problematic for data analysis and for making accurate predictions. Hence, it is necessary to detect and eliminate error data and also fill the empty fields. Missing value imputation methods are widely known for processing numerical data. PHC data has both numerical and categorical data which makes it difficult to impute. We came up with a new data processing mechanism to feed into existing machine learning algorithm. To test our idea, we used a complete PHC data set (numerical only) without any missing values. Then we generated missing values by randomly erasing a part of the data set. We used several existing imputation methods and our proposed method on the same target data set to compare their performances. It is found that the Mean Imputer, kNN and MissForest are not effective. Iterative Imputer predicted best in 7 features and ours in 4 cases. Therefore, it can be concluded that the effectiveness of imputation methods may vary depending on the specific data set and features. Our future work is to include the categorical data and monitor the performance.

查看原文本刊更多论文

远程医疗数据缺失值的估算:以移动医疗诊所系统为例

本研究旨在探讨远程医疗数据集缺失值的最佳输入方法。缺失值表示运行状况记录中的空字段。发生这种情况可能有三个主要原因:(i)参数没有测量;(ii)测量了但没有记录;(iii)在通信过程中丢失。在我们的案例研究中，移动医疗诊所(PHC)的数据是由不同的权威机构在不同的时间从多个地区收集的。PHC数据也包含手动错误。缺失和错误的数据是数据分析和做出准确预测的问题。因此，有必要检测和消除错误数据，并填充空字段。缺失值输入方法是处理数值数据的常用方法。PHC数据既有数值数据，也有分类数据，因此难以进行估算。我们提出了一种新的数据处理机制，以提供给现有的机器学习算法。为了测试我们的想法，我们使用了一个完整的PHC数据集(只有数字)，没有任何缺失值。然后我们通过随机擦除数据集的一部分来生成缺失值。在同一目标数据集上，我们使用了几种现有的插值方法和我们提出的方法来比较它们的性能。结果表明，Mean Imputer、kNN和misforest方法均不有效。迭代法在7种情况下预测最佳，在4种情况下预测最佳。因此，可以得出结论，根据具体的数据集和特征，插值方法的有效性可能会有所不同。我们未来的工作是纳入分类数据并监控性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 9th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC)

自引率

0.00%

发文量