Efficient semiparametric estimation in two-sample comparison via semisupervised learning

IF 0.8 4区数学 Q3 STATISTICS & PROBABILITY

Canadian Journal of Statistics-Revue Canadienne De Statistique Pub Date : 2024-09-03 DOI:10.1002/cjs.11813

Tao Tan, Shuyi Zhang, Yong Zhou

{"title":"Efficient semiparametric estimation in two-sample comparison via semisupervised learning","authors":"Tao Tan, Shuyi Zhang, Yong Zhou","doi":"10.1002/cjs.11813","DOIUrl":null,"url":null,"abstract":"<p>We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 2","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Journal of Statistics-Revue Canadienne De Statistique","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cjs.11813","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.

查看原文本刊更多论文

通过半监督学习进行双样本比较中的高效半参数估计

我们为双样本比较环境下的统计推断开发了一个通用的半监督框架。虽然在非正态分布响应的双样本问题中，有监督的曼-惠特尼统计法优于许多估计法，但由于它忽略了大量未标记的信息，因此效率过低。为了从无标记数据中借力，我们提出了一类使用两步半参数估算的高效自适应估计器。采用概率指数模型主要是为了降低多元协变量的维度，而后续的重新加权步骤则是为了平衡标记数据和非标记数据的贡献。我们通过相图进行方差比较，得出了估计器的渐近特性。效率理论表明，如果正确指定了概率指数模型，我们的估计器就能达到半参数方差下限；如果模型没有退化，我们的估计器比监督估计器更有效率。渐近方差是通过两步扰动重采样程序估算出来的。为了衡量有限样本的性能，我们进行了广泛的模拟研究，验证了我们的方法对模型错误指定的适应性。为了说明我们提出的方法的优点，我们分析了一个有关洛杉矶无家可归者的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Canadian Journal of Statistics-Revue Canadienne De Statistique 数学-统计学与概率论

CiteScore

1.40

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： The Canadian Journal of Statistics is the official journal of the Statistical Society of Canada. It has a reputation internationally as an excellent journal. The editorial board is comprised of statistical scientists with applied, computational, methodological, theoretical and probabilistic interests. Their role is to ensure that the journal continues to provide an international forum for the discipline of Statistics. The journal seeks papers making broad points of interest to many readers, whereas papers making important points of more specific interest are better placed in more specialized journals. The levels of innovation and impact are key in the evaluation of submitted manuscripts.