{"title":"Perturbation-based Analysis of Compositional Data","authors":"Anton Rask Lundborg, Niklas Pfister","doi":"arxiv-2311.18501","DOIUrl":null,"url":null,"abstract":"Existing statistical methods for compositional data analysis are inadequate\nfor many modern applications for two reasons. First, modern compositional\ndatasets, for example in microbiome research, display traits such as\nhigh-dimensionality and sparsity that are poorly modelled with traditional\napproaches. Second, assessing -- in an unbiased way -- how summary statistics\nof a composition (e.g., racial diversity) affect a response variable is not\nstraightforward. In this work, we propose a framework based on hypothetical\ndata perturbations that addresses both issues. Unlike existing methods for\ncompositional data, we do not transform the data and instead use perturbations\nto define interpretable statistical functionals on the compositions themselves,\nwhich we call average perturbation effects. These average perturbation effects,\nwhich can be employed in many applications, naturally account for confounding\nthat biases frequently used marginal dependence analyses. We show how average\nperturbation effects can be estimated efficiently by deriving a\nperturbation-dependent reparametrization and applying semiparametric estimation\ntechniques. We analyze the proposed estimators empirically on simulated data\nand demonstrate advantages over existing techniques on US census and microbiome\ndata. For all proposed estimators, we provide confidence intervals with uniform\nasymptotic coverage guarantees.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"86 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2311.18501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Existing statistical methods for compositional data analysis are inadequate
for many modern applications for two reasons. First, modern compositional
datasets, for example in microbiome research, display traits such as
high-dimensionality and sparsity that are poorly modelled with traditional
approaches. Second, assessing -- in an unbiased way -- how summary statistics
of a composition (e.g., racial diversity) affect a response variable is not
straightforward. In this work, we propose a framework based on hypothetical
data perturbations that addresses both issues. Unlike existing methods for
compositional data, we do not transform the data and instead use perturbations
to define interpretable statistical functionals on the compositions themselves,
which we call average perturbation effects. These average perturbation effects,
which can be employed in many applications, naturally account for confounding
that biases frequently used marginal dependence analyses. We show how average
perturbation effects can be estimated efficiently by deriving a
perturbation-dependent reparametrization and applying semiparametric estimation
techniques. We analyze the proposed estimators empirically on simulated data
and demonstrate advantages over existing techniques on US census and microbiome
data. For all proposed estimators, we provide confidence intervals with uniform
asymptotic coverage guarantees.