A novel ranked k-nearest neighbors algorithm for missing data imputation.

IF 1.1 4区数学 Q2 STATISTICS & PROBABILITY

Journal of Applied Statistics Pub Date : 2024-10-11 eCollection Date: 2025-01-01 DOI:10.1080/02664763.2024.2414357

Yasir Khan, Said Farooq Shah, Syed Muhammad Asim

{"title":"A novel ranked k-nearest neighbors algorithm for missing data imputation.","authors":"Yasir Khan, Said Farooq Shah, Syed Muhammad Asim","doi":"10.1080/02664763.2024.2414357","DOIUrl":null,"url":null,"abstract":"Missing data is a common problem in many domains that rely on data analysis. The k Nearest Neighbors imputation method has been widely used to address this issue, but it has limitations in accurately imputing missing values, especially for datasets with small pairwise correlations and small values of k. In this study, we proposed a method, Ranked k Nearest Neighbors imputation that uses a similar approach to k Nearest Neighbor, but utilizing the concept of Ranked set sampling to select the most relevant neighbors for imputation. Our results show that the proposed method outperforms the standard k nearest neighbor method in terms of imputation accuracy both in case of Missing Completely at Random and Missing at Random mechanism, as demonstrated by consistently lower MSIE and MAIE values across all datasets. This suggests that the proposed method is a promising alternative for imputing missing values in datasets with small pairwise correlations and small values of k. Thus, the proposed Ranked k Nearest Neighbor method has important implications for data imputation in various domains and can contribute to the development of more efficient and accurate imputation methods without adding any computational complexity to an algorithm.","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"1103-1127"},"PeriodicalIF":1.1000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951327/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1080/02664763.2024.2414357","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Missing data is a common problem in many domains that rely on data analysis. The k Nearest Neighbors imputation method has been widely used to address this issue, but it has limitations in accurately imputing missing values, especially for datasets with small pairwise correlations and small values of k. In this study, we proposed a method, Ranked k Nearest Neighbors imputation that uses a similar approach to k Nearest Neighbor, but utilizing the concept of Ranked set sampling to select the most relevant neighbors for imputation. Our results show that the proposed method outperforms the standard k nearest neighbor method in terms of imputation accuracy both in case of Missing Completely at Random and Missing at Random mechanism, as demonstrated by consistently lower MSIE and MAIE values across all datasets. This suggests that the proposed method is a promising alternative for imputing missing values in datasets with small pairwise correlations and small values of k. Thus, the proposed Ranked k Nearest Neighbor method has important implications for data imputation in various domains and can contribute to the development of more efficient and accurate imputation methods without adding any computational complexity to an algorithm.

查看原文本刊更多论文

一种新的k近邻排序缺失数据输入算法。

在许多依赖数据分析的领域中，丢失数据是一个常见的问题。k近邻归算方法已被广泛用于解决这一问题，但它在准确归算缺失值方面存在局限性，特别是对于具有小成对相关性和小k值的数据集。在本研究中，我们提出了一种方法，rank k Nearest Neighbors imputation，它使用类似于k近邻的方法，但利用rank集抽样的概念来选择最相关的邻居进行归算。我们的研究结果表明，在完全随机缺失和随机缺失机制的情况下，所提出的方法在imputation精度方面优于标准k近邻方法，所有数据集的MSIE和MAIE值都始终较低。这表明，所提出的方法是一种有希望的替代方法，用于在具有小成对相关性和小k值的数据集中输入缺失值。因此，所提出的排名k最近邻方法对各个领域的数据输入具有重要意义，并且可以有助于开发更有效和准确的输入方法，而不会增加算法的计算复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Applied Statistics 数学-统计学与概率论

CiteScore

3.40

自引率

0.00%

发文量

126

审稿时长

6 months

期刊介绍： Journal of Applied Statistics provides a forum for communication between both applied statisticians and users of applied statistical techniques across a wide range of disciplines. These areas include business, computing, economics, ecology, education, management, medicine, operational research and sociology, but papers from other areas are also considered. The editorial policy is to publish rigorous but clear and accessible papers on applied techniques. Purely theoretical papers are avoided but those on theoretical developments which clearly demonstrate significant applied potential are welcomed. Each paper is submitted to at least two independent referees.