{"title":"Extended-support beta regression for $[0, 1]$ responses","authors":"Ioannis Kosmidis, Achim Zeileis","doi":"arxiv-2409.07233","DOIUrl":"https://doi.org/arxiv-2409.07233","url":null,"abstract":"We introduce the XBX regression model, a continuous mixture of\u0000extended-support beta regressions for modeling bounded responses with or\u0000without boundary observations. The core building block of the new model is the\u0000extended-support beta distribution, which is a censored version of a\u0000four-parameter beta distribution with the same exceedance on the left and right\u0000of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We\u0000prove that both beta regression with dispersion effects and heteroscedastic\u0000normal regression with censoring at both $0$ and $1$ -- known as the\u0000heteroscedastic two-limit tobit model in the econometrics literature -- are\u0000special cases of the extended-support beta regression model, depending on\u0000whether a single extra parameter is zero or infinity, respectively. To overcome\u0000identifiability issues that may arise in estimating the extra parameter due to\u0000the similarity of the beta and normal distribution for certain parameter\u0000settings, we assume that the additional parameter has an exponential\u0000distribution with an unknown mean. The associated marginal likelihood can be\u0000conveniently and accurately approximated using a Gauss-Laguerre quadrature\u0000rule, resulting in efficient estimation and inference procedures. The new model\u0000is used to analyze investment decisions in a behavioral economics experiment,\u0000where the occurrence and extent of loss aversion is of interest. In contrast to\u0000standard approaches, XBX regression can simultaneously capture the probability\u0000of rational behavior as well as the mean amount of loss aversion. Moreover, the\u0000effectiveness of the new model is illustrated through extensive numerical\u0000comparisons with alternative models.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"108 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local Effects of Continuous Instruments without Positivity","authors":"Prabrisha Rakshit, Alexander Levis, Luke Keele","doi":"arxiv-2409.07350","DOIUrl":"https://doi.org/arxiv-2409.07350","url":null,"abstract":"Instrumental variables have become a popular study design for the estimation\u0000of treatment effects in the presence of unobserved confounders. In the\u0000canonical instrumental variables design, the instrument is a binary variable,\u0000and most extant methods are tailored to this context. In many settings,\u0000however, the instrument is a continuous measure. Standard estimation methods\u0000can be applied with continuous instruments, but they require strong assumptions\u0000regarding functional form. Moreover, while some recent work has introduced more\u0000flexible approaches for continuous instruments, these methods require an\u0000assumption known as positivity that is unlikely to hold in many applications.\u0000We derive a novel family of causal estimands using a stochastic dynamic\u0000intervention framework that considers a range of intervention distributions\u0000that are absolutely continuous with respect to the observed distribution of the\u0000instrument. These estimands focus on a specific form of local effect but do not\u0000require a positivity assumption. Next, we develop doubly robust estimators for\u0000these estimands that allow for estimation of the nuisance functions via\u0000nonparametric estimators. We use empirical process theory and sample splitting\u0000to derive asymptotic properties of the proposed estimators under weak\u0000conditions. In addition, we derive methods for profiling the principal strata\u0000as well as a method for sensitivity analysis for assessing robustness to an\u0000underlying monotonicity assumption. We evaluate our methods via simulation and\u0000demonstrate their feasibility using an application on the effectiveness of\u0000surgery for specific emergency conditions.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-source Stable Variable Importance Measure via Adversarial Machine Learning","authors":"Zitao Wang, Nian Si, Zijian Guo, Molei Liu","doi":"arxiv-2409.07380","DOIUrl":"https://doi.org/arxiv-2409.07380","url":null,"abstract":"As part of enhancing the interpretability of machine learning, it is of\u0000renewed interest to quantify and infer the predictive importance of certain\u0000exposure covariates. Modern scientific studies often collect data from multiple\u0000sources with distributional heterogeneity. Thus, measuring and inferring stable\u0000associations across multiple environments is crucial in reliable and\u0000generalizable decision-making. In this paper, we propose MIMAL, a novel\u0000statistical framework for Multi-source stable Importance Measure via\u0000Adversarial Learning. MIMAL measures the importance of some exposure variables\u0000by maximizing the worst-case predictive reward over the source mixture. Our\u0000framework allows various machine learning methods for confounding adjustment\u0000and exposure effect characterization. For inferential analysis, the asymptotic\u0000normality of our introduced statistic is established under a general machine\u0000learning framework that requires no stronger learning accuracy conditions than\u0000those for single source variable importance. Numerical studies with various\u0000types of data generation setups and machine learning implementation are\u0000conducted to justify the finite-sample performance of MIMAL. We also illustrate\u0000our method through a real-world study of Beijing air pollution in multiple\u0000locations.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning","authors":"Matteo D'Alessandro, Theophilus Quachie Asenso, Manuela Zucknick","doi":"arxiv-2409.07125","DOIUrl":"https://doi.org/arxiv-2409.07125","url":null,"abstract":"Modeling with multi-omics data presents multiple challenges such as the\u0000high-dimensionality of the problem ($p gg n$), the presence of interactions\u0000between features, and the need for integration between multiple data sources.\u0000We establish an interaction model that allows for the inclusion of multiple\u0000sources of data from the integration of two existing methods, pliable lasso and\u0000cooperative learning. The integrated model is tested both on simulation studies\u0000and on real multi-omics datasets for predicting labor onset and cancer\u0000treatment response. The results show that the model is effective in modeling\u0000multi-source data in various scenarios where interactions are present, both in\u0000terms of prediction performance and selection of relevant variables.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"195 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential stratified inference for the mean","authors":"Jacob V. Spertus, Mayuri Sridhar, Philip B. Stark","doi":"arxiv-2409.06680","DOIUrl":"https://doi.org/arxiv-2409.06680","url":null,"abstract":"We develop conservative tests for the mean of a bounded population using data\u0000from a stratified sample. The sample may be drawn sequentially, with or without\u0000replacement. The tests are \"anytime valid,\" allowing optional stopping and\u0000continuation in each stratum. We call this combination of properties\u0000sequential, finite-sample, nonparametric validity. The methods express a\u0000hypothesis about the population mean as a union of intersection hypotheses\u0000describing within-stratum means. They test each intersection hypothesis using\u0000independent test supermartingales (TSMs) combined across strata by\u0000multiplication. The $P$-value of the global null hypothesis is then the maximum\u0000$P$-value of any intersection hypothesis in the union. This approach has three\u0000primary moving parts: (i) the rule for deciding which stratum to draw from next\u0000to test each intersection null, given the sample so far; (ii) the form of the\u0000TSM for each null in each stratum; and (iii) the method of combining evidence\u0000across strata. These choices interact. We examine the performance of a variety\u0000of rules with differing computational complexity. Approximately optimal methods\u0000have a prohibitive computational cost, while naive rules may be inconsistent --\u0000they will never reject for some alternative populations, no matter how large\u0000the sample. We present a method that is statistically comparable to optimal\u0000methods in examples where optimal methods are computable, but computationally\u0000tractable for arbitrarily many strata. In numerical examples its expected\u0000sample size is substantially smaller than that of previous methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonparametric Inference for Balance in Signed Networks","authors":"Xuyang Chen, Yinjie Wang, Weijing Tang","doi":"arxiv-2409.06172","DOIUrl":"https://doi.org/arxiv-2409.06172","url":null,"abstract":"In many real-world networks, relationships often go beyond simple dyadic\u0000presence or absence; they can be positive, like friendship, alliance, and\u0000mutualism, or negative, characterized by enmity, disputes, and competition. To\u0000understand the formation mechanism of such signed networks, the social balance\u0000theory sheds light on the dynamics of positive and negative connections. In\u0000particular, it characterizes the proverbs, \"a friend of my friend is my friend\"\u0000and \"an enemy of my enemy is my friend\". In this work, we propose a\u0000nonparametric inference approach for assessing empirical evidence for the\u0000balance theory in real-world signed networks. We first characterize the\u0000generating process of signed networks with node exchangeability and propose a\u0000nonparametric sparse signed graphon model. Under this model, we construct\u0000confidence intervals for the population parameters associated with balance\u0000theory and establish their theoretical validity. Our inference procedure is as\u0000computationally efficient as a simple normal approximation but offers\u0000higher-order accuracy. By applying our method, we find strong real-world\u0000evidence for balance theory in signed networks across various domains,\u0000extending its applicability beyond social psychology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ensemble Doubly Robust Bayesian Inference via Regression Synthesis","authors":"Kaoru Babasaki, Shonosuke Sugasawa, Kosaku Takanashi, Kenichiro McAlinn","doi":"arxiv-2409.06288","DOIUrl":"https://doi.org/arxiv-2409.06288","url":null,"abstract":"The doubly robust estimator, which models both the propensity score and\u0000outcomes, is a popular approach to estimate the average treatment effect in the\u0000potential outcome setting. The primary appeal of this estimator is its\u0000theoretical property, wherein the estimator achieves consistency as long as\u0000either the propensity score or outcomes is correctly specified. In most\u0000applications, however, both are misspecified, leading to considerable bias that\u0000cannot be checked. In this paper, we propose a Bayesian ensemble approach that\u0000synthesizes multiple models for both the propensity score and outcomes, which\u0000we call doubly robust Bayesian regression synthesis. Our approach applies\u0000Bayesian updating to the ensemble model weights that adapt at the unit level,\u0000incorporating data heterogeneity, to significantly mitigate misspecification\u0000bias. Theoretically, we show that our proposed approach is consistent regarding\u0000the estimation of both the propensity score and outcomes, ensuring that the\u0000doubly robust estimator is consistent, even if no single model is correctly\u0000specified. An efficient algorithm for posterior computation facilitates the\u0000characterization of uncertainty regarding the treatment effect. Our proposed\u0000approach is compared against standard and state-of-the-art methods through two\u0000comprehensive simulation studies, where we find that our approach is superior\u0000in all cases. An empirical study on the impact of maternal smoking on birth\u0000weight highlights the practical applicability of our proposed method.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":"https://doi.org/arxiv-2409.06180","url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\u0000advancing personalized medicine. Achieving this goal necessitates determining a\u0000suitable sample size that ensures adequate statistical power without undue\u0000resource allocation. Current sample size calculation methods rely on\u0000assumptions and algorithms that may not align with supervised machine learning\u0000techniques for sample classification. Addressing this critical methodological\u0000gap, we present a novel computational approach that establishes the\u0000power-versus-sample-size relationship by employing a data augmentation strategy\u0000followed by fitting a learning curve. We comprehensively evaluated its\u0000performance for microRNA and RNA sequencing data, considering diverse data\u0000characteristics and algorithm configurations, based on a spectrum of evaluation\u0000metrics. To foster accessibility and reproducibility, the Python and R code for\u0000implementing our approach is available on GitHub. Its deployment will\u0000significantly facilitate the adoption of machine learning in transcriptomics\u0000studies and accelerate their translation into clinically useful classifiers for\u0000personalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new paradigm for global sensitivity analysis","authors":"Gildas MazoMaIAGE","doi":"arxiv-2409.06271","DOIUrl":"https://doi.org/arxiv-2409.06271","url":null,"abstract":"<div><p>Current theory of global sensitivity analysis, based on a nonlinear\u0000functional ANOVA decomposition of the random output, is limited in scope-for\u0000instance, the analysis is limited to the output's variance and the inputs have\u0000to be mutually independent-and leads to sensitivity indices the interpretation\u0000of which is not fully clear, especially interaction effects. Alternatively,\u0000sensitivity indices built for arbitrary user-defined importance measures have\u0000been proposed but a theory to define interactions in a systematic fashion\u0000and/or establish a decomposition of the total importance measure is still\u0000missing. It is shown that these important problems are solved all at once by\u0000adopting a new paradigm. By partitioning the inputs into those causing the\u0000change in the output and those which do not, arbitrary user-defined variability\u0000measures are identified with the outcomes of a factorial experiment at two\u0000levels, leading to all factorial effects without assuming any functional\u0000decomposition. To link various well-known sensitivity indices of the literature\u0000(Sobol indices and Shapley effects), weighted factorial effects are studied and\u0000utilized.</p></div>","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Causal Analysis of Shapley Values: Conditional vs. Marginal","authors":"Ilya Rozenfeld","doi":"arxiv-2409.06157","DOIUrl":"https://doi.org/arxiv-2409.06157","url":null,"abstract":"Shapley values, a game theoretic concept, has been one of the most popular\u0000tools for explaining Machine Learning (ML) models in recent years.\u0000Unfortunately, the two most common approaches, conditional and marginal, to\u0000calculating Shapley values can lead to different results along with some\u0000undesirable side effects when features are correlated. This in turn has led to\u0000the situation in the literature where contradictory recommendations regarding\u0000choice of an approach are provided by different authors. In this paper we aim\u0000to resolve this controversy through the use of causal arguments. We show that\u0000the differences arise from the implicit assumptions that are made within each\u0000method to deal with missing causal information. We also demonstrate that the\u0000conditional approach is fundamentally unsound from a causal perspective. This,\u0000together with previous work in [1], leads to the conclusion that the marginal\u0000approach should be preferred over the conditional one.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"192 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}