Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information.

Robert M Flight, Praneeth S Bhatt, Hunter Nb Moseley
{"title":"Information-Content-Informed Kendall-tau Correlation Methodology: Interpreting Missing Values as Useful Information.","authors":"Robert M Flight, Praneeth S Bhatt, Hunter Nb Moseley","doi":"10.1101/2022.02.24.481854","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In omics data sets that are derived from analytical measurements, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent useful information by virtue of their \"missingness\" at one end of the data distribution.</p><p><strong>Results: </strong>To include this information due to left-censorship missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Using both simulated and real data sets from RNA-seq, metabolomics, and lipidomics experiments, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples.</p><p><strong>Conclusions: </strong>The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.</p>","PeriodicalId":72407,"journal":{"name":"bioRxiv : the preprint server for biology","volume":"38 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12330630/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.02.24.481854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In omics data sets that are derived from analytical measurements, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent useful information by virtue of their "missingness" at one end of the data distribution.

Results: To include this information due to left-censorship missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Using both simulated and real data sets from RNA-seq, metabolomics, and lipidomics experiments, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples.

Conclusions: The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.

信息-内容-肯德尔-陶相关方法学:将缺失值解释为有用信息。
背景:目前几乎所有可用的相关度量都无法直接处理缺失值。通常,缺失值要么被完全忽略,要么被输入并用于计算相关系数。在任何一种情况下,相关性值都将受到一个视角的影响,即缺失的数据不代表有用的信息。然而,由于各种原因,在实际数据集中会出现缺失值。在来自分析测量的组学数据集中,缺失值的主要原因是特定的可测量现象低于分析仪器的检测限(左删节值)。这些丢失的数据不是随机丢失的,而是由于它们在数据分布的一端“丢失”而代表有用的信息。结果:为了包含由于左审查缺失而导致的信息,我们提出了信息内容知情的Kendall-tau (ICI-Kt)方法。我们展示了如何在Kendall-tau相关系数的定义中包含左删减缺失值,以及该包含如何导致对添加到相关中的信息的解释。我们还实现了对理论最大值和成对完备性的额外度量的计算,这些度量在方法中增加了进一步的信息解释层。利用RNA-seq、代谢组学和脂质组学实验的模拟和真实数据集,我们证明了ICI-Kt方法允许将左删减缺失数据值作为可解释信息,从而改进了离群样本的确定和改进了特征特征网络的构建。我们在R和Python中提供了显式并行实现,允许在对大量样本应用ICI-Kt方法时快速计算所有变量。结论:ICI-Kt方法在GitHub上分别以R包和Python模块的形式提供,分别位于https://github.com/moseleyBioinformaticsLab/ICIKendallTau和https://github.com/moseleyBioinformaticsLab/icikt。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信