Efficient semiparametric estimation in two‐sample comparison via semisupervised learning

Tao Tan, Shuyi Zhang, Yong Zhou
{"title":"Efficient semiparametric estimation in two‐sample comparison via semisupervised learning","authors":"Tao Tan, Shuyi Zhang, Yong Zhou","doi":"10.1002/cjs.11813","DOIUrl":null,"url":null,"abstract":"We develop a general semisupervised framework for statistical inference in the two‐sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two‐sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two‐step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow‐up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two‐step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.","PeriodicalId":501595,"journal":{"name":"The Canadian Journal of Statistics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Canadian Journal of Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cjs.11813","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We develop a general semisupervised framework for statistical inference in the two‐sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two‐sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two‐step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow‐up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two‐step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.
通过半监督学习进行双样本比较中的高效半参数估计
我们为双样本比较环境下的统计推断开发了一个通用的半监督框架。虽然在非正态分布响应的双样本问题中,有监督的曼-惠特尼统计法优于许多估计法,但由于它忽略了大量未标记的信息,因此效率过低。为了从无标记数据中借力,我们提出了一类使用两步半参数估算的高效自适应估计器。采用概率指数模型主要是为了降低多元协变量的维度,而后续的重新加权步骤则是为了平衡标记数据和非标记数据的贡献。我们通过相图进行方差比较,得出了估计器的渐近特性。效率理论表明,如果正确指定了概率指数模型,我们的估计器就能达到半参数方差下限;如果模型没有退化,我们的估计器比监督估计器更有效率。渐近方差是通过两步扰动重采样程序估算出来的。为了衡量有限样本的性能,我们进行了广泛的模拟研究,验证了我们的方法对模型错误指定的适应性。为了说明我们提出的方法的优点,我们分析了一个有关洛杉矶无家可归者的数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信