{"title":"通过半监督学习进行双样本比较中的高效半参数估计","authors":"Tao Tan, Shuyi Zhang, Yong Zhou","doi":"10.1002/cjs.11813","DOIUrl":null,"url":null,"abstract":"<p>We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.</p>","PeriodicalId":55281,"journal":{"name":"Canadian Journal of Statistics-Revue Canadienne De Statistique","volume":"53 2","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient semiparametric estimation in two-sample comparison via semisupervised learning\",\"authors\":\"Tao Tan, Shuyi Zhang, Yong Zhou\",\"doi\":\"10.1002/cjs.11813\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.</p>\",\"PeriodicalId\":55281,\"journal\":{\"name\":\"Canadian Journal of Statistics-Revue Canadienne De Statistique\",\"volume\":\"53 2\",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Canadian Journal of Statistics-Revue Canadienne De Statistique\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cjs.11813\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Journal of Statistics-Revue Canadienne De Statistique","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cjs.11813","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
Efficient semiparametric estimation in two-sample comparison via semisupervised learning
We develop a general semisupervised framework for statistical inference in the two-sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two-sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two-step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow-up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two-step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.
期刊介绍:
The Canadian Journal of Statistics is the official journal of the Statistical Society of Canada. It has a reputation internationally as an excellent journal. The editorial board is comprised of statistical scientists with applied, computational, methodological, theoretical and probabilistic interests. Their role is to ensure that the journal continues to provide an international forum for the discipline of Statistics.
The journal seeks papers making broad points of interest to many readers, whereas papers making important points of more specific interest are better placed in more specialized journals. The levels of innovation and impact are key in the evaluation of submitted manuscripts.