惊喜抽样:改进和推广局部病例对照抽样

Xinwei Shen, Kani Chen, Wen Yu
{"title":"惊喜抽样:改进和推广局部病例对照抽样","authors":"Xinwei Shen, Kani Chen, Wen Yu","doi":"10.1214/21-EJS1844","DOIUrl":null,"url":null,"abstract":"Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear \"surprising\" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Surprise sampling: Improving and extending the local case-control sampling\",\"authors\":\"Xinwei Shen, Kani Chen, Wen Yu\",\"doi\":\"10.1214/21-EJS1844\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear \\\"surprising\\\" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.\",\"PeriodicalId\":186390,\"journal\":{\"name\":\"arXiv: Methodology\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv: Methodology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1214/21-EJS1844\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv: Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1214/21-EJS1844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

Fithian和Hastie(2014)提出了一种新的抽样方案,称为局部病例控制(LCC)抽样,通过利用与logistic模型相关的巧妙调整来实现稳定性和效率。它对于大型和不平衡数据的分类特别有用。本文提出了一种更通用的抽样方案,该方案基于一个工作原理,即如果数据点包含更多的信息,或者在先导预测误差大或绝对分数大的意义上出现“令人惊讶”,则数据点应该得到更高的抽样概率。与现有的相关抽样方案相比,Fithian和Hastie(2014)以及Ai等人(2018)的报告显示,本文提出的抽样方案具有几个优势。它自适应地给出了各种目标的最优形式,包括LCC和Ai等人(2018)的抽样作为特殊情况。在相同的模型规范下,所提出的估计器的性能也不比文献中的估计器差。即使模型被错误指定和/或先导估计不一致或依赖于全部数据,估计过程也是有效的。我们提出理论证明的优势和最优的估计和抽样设计。与Ai等人(2018)不同,我们的大样本理论是人口智慧而不是数据智慧。此外,所提出的方法可以应用于无监督学习研究,因为它本质上只需要一个特定的损失函数,不需要数据的响应协变量结构。并进行了数值研究,给出了支持该理论的证据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Surprise sampling: Improving and extending the local case-control sampling
Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信