Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information.

bioRxiv : the preprint server for biology Pub Date : 2025-07-21 DOI:10.1101/2022.02.24.481854

Robert M Flight, Praneeth S Bhatt, Hunter Nb Moseley

{"title":"Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information.","authors":"Robert M Flight, Praneeth S Bhatt, Hunter Nb Moseley","doi":"10.1101/2022.02.24.481854","DOIUrl":null,"url":null,"abstract":"Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In omics data sets that are derived from analytical measurements, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent useful information by virtue of their \"missingness\" at one end of the data distribution.Results: To include this information due to left-censorship missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Using both simulated and real data sets from RNA-seq, metabolomics, and lipidomics experiments, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples.Conclusions: The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12330630/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.02.24.481854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In omics data sets that are derived from analytical measurements, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent useful information by virtue of their "missingness" at one end of the data distribution.

Results: To include this information due to left-censorship missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Using both simulated and real data sets from RNA-seq, metabolomics, and lipidomics experiments, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples.

Conclusions: The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.

查看原文本刊更多论文

信息-内容-肯德尔-陶相关方法学：将缺失值解释为有用信息。

背景：目前几乎所有可用的相关度量都无法直接处理缺失值。通常，缺失值要么被完全忽略，要么被输入并用于计算相关系数。在任何一种情况下，相关性值都将受到一个视角的影响，即缺失的数据不代表有用的信息。然而，由于各种原因，在实际数据集中会出现缺失值。在来自分析测量的组学数据集中，缺失值的主要原因是特定的可测量现象低于分析仪器的检测限（左删节值）。这些丢失的数据不是随机丢失的，而是由于它们在数据分布的一端“丢失”而代表有用的信息。结果：为了包含由于左审查缺失而导致的信息，我们提出了信息内容知情的Kendall-tau （ICI-Kt）方法。我们展示了如何在Kendall-tau相关系数的定义中包含左删减缺失值，以及该包含如何导致对添加到相关中的信息的解释。我们还实现了对理论最大值和成对完备性的额外度量的计算，这些度量在方法中增加了进一步的信息解释层。利用RNA-seq、代谢组学和脂质组学实验的模拟和真实数据集，我们证明了ICI-Kt方法允许将左删减缺失数据值作为可解释信息，从而改进了离群样本的确定和改进了特征特征网络的构建。我们在R和Python中提供了显式并行实现，允许在对大量样本应用ICI-Kt方法时快速计算所有变量。结论：ICI-Kt方法在GitHub上分别以R包和Python模块的形式提供，分别位于https://github.com/moseleyBioinformaticsLab/ICIKendallTau和https://github.com/moseleyBioinformaticsLab/icikt。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量