Analyzing Coarsened and Missing Data by Imputation Methods.

IF 1.8 4区医学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Statistics in Medicine Pub Date : 2025-03-15 DOI:10.1002/sim.70032

Lars L J van der Burg, Stefan Böhringer, Jonathan W Bartlett, Tjalling Bosse, Nanda Horeweg, Liesbeth C de Wreede, Hein Putter

{"title":"Analyzing Coarsened and Missing Data by Imputation Methods.","authors":"Lars L J van der Burg, Stefan Böhringer, Jonathan W Bartlett, Tjalling Bosse, Nanda Horeweg, Liesbeth C de Wreede, Hein Putter","doi":"10.1002/sim.70032","DOIUrl":null,"url":null,"abstract":"<p><p>In various missing data problems, values are not entirely missing, but are coarsened. For coarsened observations, instead of observing the true value, a subset of values - strictly smaller than the full sample space of the variable - is observed to which the true value belongs. In our motivating example for patients with endometrial carcinoma, the degree of lymphovascular space invasion (LVSI) can be either absent, focally present, or substantially present. For a subset of individuals, however, LVSI is reported as being present, which includes both non-absent options. In the analysis of such a dataset, difficulties arise when coarsened observations are to be used in an imputation procedure. To our knowledge, no clear-cut method has been described in the literature on how to handle an observed subset of values, and treating them as entirely missing could lead to biased estimates. Therefore, in this paper, we evaluated the best strategy to deal with coarsened and missing data in multiple imputation. We tested a number of plausible ad hoc approaches, possibly already in use by statisticians. Additionally, we propose a principled approach to this problem, consisting of an adaptation of the SMC-FCS algorithm (SMC-FCS <math> <semantics> <mrow><msub><mo> </mo> <mrow><mtext>CoCo</mtext></mrow> </msub> </mrow> <annotation>$$ {}_{\\mathrm{CoCo}} $$</annotation></semantics> </math> : Coarsening compatible), that ensures that imputed values adhere to the coarsening information. These methods were compared in a simulation study. This comparison shows that methods that prevent imputations of incompatible values, like the SMC-FCS <math> <semantics> <mrow><msub><mo> </mo> <mrow><mtext>CoCo</mtext></mrow> </msub> </mrow> <annotation>$$ {}_{\\mathrm{CoCo}} $$</annotation></semantics> </math> method, perform consistently better in terms of a lower bias and RMSE, and achieve better coverage than methods that ignore coarsening or handle it in a more naïve way. The analysis of the motivating example shows that the way the coarsening information is handled can matter substantially, leading to different conclusions across methods. Overall, our proposed SMC-FCS <math> <semantics> <mrow><msub><mo> </mo> <mrow><mtext>CoCo</mtext></mrow> </msub> </mrow> <annotation>$$ {}_{\\mathrm{CoCo}} $$</annotation></semantics> </math> method outperforms other methods in handling coarsened data, requires limited additional computation cost and is easily extendable to other scenarios.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"44 6","pages":"e70032"},"PeriodicalIF":1.8000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11881681/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/sim.70032","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

In various missing data problems, values are not entirely missing, but are coarsened. For coarsened observations, instead of observing the true value, a subset of values - strictly smaller than the full sample space of the variable - is observed to which the true value belongs. In our motivating example for patients with endometrial carcinoma, the degree of lymphovascular space invasion (LVSI) can be either absent, focally present, or substantially present. For a subset of individuals, however, LVSI is reported as being present, which includes both non-absent options. In the analysis of such a dataset, difficulties arise when coarsened observations are to be used in an imputation procedure. To our knowledge, no clear-cut method has been described in the literature on how to handle an observed subset of values, and treating them as entirely missing could lead to biased estimates. Therefore, in this paper, we evaluated the best strategy to deal with coarsened and missing data in multiple imputation. We tested a number of plausible ad hoc approaches, possibly already in use by statisticians. Additionally, we propose a principled approach to this problem, consisting of an adaptation of the SMC-FCS algorithm (SMC-FCS $_{CoCo}$ : Coarsening compatible), that ensures that imputed values adhere to the coarsening information. These methods were compared in a simulation study. This comparison shows that methods that prevent imputations of incompatible values, like the SMC-FCS $_{CoCo}$ method, perform consistently better in terms of a lower bias and RMSE, and achieve better coverage than methods that ignore coarsening or handle it in a more naïve way. The analysis of the motivating example shows that the way the coarsening information is handled can matter substantially, leading to different conclusions across methods. Overall, our proposed SMC-FCS $_{CoCo}$ method outperforms other methods in handling coarsened data, requires limited additional computation cost and is easily extendable to other scenarios.

查看原文本刊更多论文

通过估算方法分析粗化数据和缺失数据。

在各种丢失数据的问题中，值不是完全丢失，而是被粗化了。对于粗化的观测值，不是观察真实值，而是观察真实值所属的值子集——严格小于变量的完整样本空间。在我们的子宫内膜癌患者的激励例子中，淋巴血管间隙侵犯（LVSI）的程度可以不存在，局部存在或大量存在。然而，对于一小部分个体，LVSI被报道为存在，其中包括非缺失选项。在对这样一个数据集进行分析时，当将粗化的观测值用于估算过程时，就会出现困难。据我们所知，文献中没有明确的方法描述如何处理观察到的值子集，并且将它们视为完全缺失可能导致有偏差的估计。因此，在本文中，我们评估了处理多次插值中粗化和缺失数据的最佳策略。我们测试了一些貌似合理的临时方法，这些方法可能已经被统计学家使用。此外，我们提出了一种原则性的方法来解决这个问题，包括对SMC-FCS算法的适应（SMC-FCS CoCo $$ {}_{\mathrm{CoCo}} $$：粗化兼容），以确保输入值坚持粗化信息。在模拟研究中对这些方法进行了比较。这个比较表明，防止不相容值的impuimputs方法，如SMC-FCS CoCo $$ {}_{\mathrm{CoCo}} $$方法，在较低的偏差和RMSE方面始终表现更好，并且比忽略粗化或以更naïve的方式处理粗化的方法获得更好的覆盖率。对激励示例的分析表明，处理粗化信息的方式可能会产生重大影响，从而导致不同方法的不同结论。总体而言，我们提出的SMC-FCS CoCo $$ {}_{\mathrm{CoCo}} $$方法在处理粗化数据方面优于其他方法，需要的额外计算成本有限，并且易于扩展到其他场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics in Medicine 医学-公共卫生、环境卫生与职业卫生

CiteScore

3.40

自引率

10.00%

发文量

334

审稿时长

2-4 weeks

期刊介绍： The journal aims to influence practice in medicine and its associated sciences through the publication of papers on statistical and other quantitative methods. Papers will explain new methods and demonstrate their application, preferably through a substantive, real, motivating example or a comprehensive evaluation based on an illustrative example. Alternatively, papers will report on case-studies where creative use or technical generalizations of established methodology is directed towards a substantive application. Reviews of, and tutorials on, general topics relevant to the application of statistics to medicine will also be published. The main criteria for publication are appropriateness of the statistical methods to a particular medical problem and clarity of exposition. Papers with primarily mathematical content will be excluded. The journal aims to enhance communication between statisticians, clinicians and medical researchers.