先验概率偏移下的高效数据整合。

IF 1.7 4区数学 Q3 BIOLOGY

Biometrics Pub Date : 2024-03-27 DOI:10.1093/biomtc/ujae035

Ming-Yueh Huang, Jing Qin, Chiung-Yu Huang

{"title":"先验概率偏移下的高效数据整合。","authors":"Ming-Yueh Huang, Jing Qin, Chiung-Yu Huang","doi":"10.1093/biomtc/ujae035","DOIUrl":null,"url":null,"abstract":"Conventional supervised learning usually operates under the premise that data are collected from the same underlying population. However, challenges may arise when integrating new data from different populations, resulting in a phenomenon known as dataset shift. This paper focuses on prior probability shift, where the distribution of the outcome varies across datasets but the conditional distribution of features given the outcome remains the same. To tackle the challenges posed by such shift, we propose an estimation algorithm that can efficiently combine information from multiple sources. Unlike existing methods that are restricted to discrete outcomes, the proposed approach accommodates both discrete and continuous outcomes. It also handles high-dimensional covariate vectors through variable selection using an adaptive least absolute shrinkage and selection operator penalty, producing efficient estimates that possess the oracle property. Moreover, a novel semiparametric likelihood ratio test is proposed to check the validity of prior probability shift assumptions by embedding the null conditional density function into Neyman's smooth alternatives (Neyman, 1937) and testing study-specific parameters. We demonstrate the effectiveness of our proposed method through extensive simulations and a real data example. The proposed methods serve as a useful addition to the repertoire of tools for dealing dataset shifts.","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"80 2","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient data integration under prior probability shift.\",\"authors\":\"Ming-Yueh Huang, Jing Qin, Chiung-Yu Huang\",\"doi\":\"10.1093/biomtc/ujae035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional supervised learning usually operates under the premise that data are collected from the same underlying population. However, challenges may arise when integrating new data from different populations, resulting in a phenomenon known as dataset shift. This paper focuses on prior probability shift, where the distribution of the outcome varies across datasets but the conditional distribution of features given the outcome remains the same. To tackle the challenges posed by such shift, we propose an estimation algorithm that can efficiently combine information from multiple sources. Unlike existing methods that are restricted to discrete outcomes, the proposed approach accommodates both discrete and continuous outcomes. It also handles high-dimensional covariate vectors through variable selection using an adaptive least absolute shrinkage and selection operator penalty, producing efficient estimates that possess the oracle property. Moreover, a novel semiparametric likelihood ratio test is proposed to check the validity of prior probability shift assumptions by embedding the null conditional density function into Neyman's smooth alternatives (Neyman, 1937) and testing study-specific parameters. We demonstrate the effectiveness of our proposed method through extensive simulations and a real data example. The proposed methods serve as a useful addition to the repertoire of tools for dealing dataset shifts.\",\"PeriodicalId\":8930,\"journal\":{\"name\":\"Biometrics\",\"volume\":\"80 2\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biometrics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1093/biomtc/ujae035\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biomtc/ujae035","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

传统的监督学习通常以从同一基础人群中收集数据为前提。然而，在整合来自不同群体的新数据时，可能会出现挑战，导致一种被称为数据集偏移的现象。本文重点讨论先验概率偏移，即不同数据集的结果分布不同，但给定结果的特征的条件分布保持不变。为了应对这种偏移带来的挑战，我们提出了一种能有效结合多种来源信息的估计算法。与局限于离散结果的现有方法不同，我们提出的方法同时适用于离散和连续结果。它还通过使用自适应最小绝对收缩和选择算子惩罚的变量选择来处理高维协变量向量，从而产生具有甲骨文特性的高效估计。此外，我们还提出了一种新的半参数似然比检验方法，通过将空条件密度函数嵌入奈曼平滑替代变量（奈曼，1937 年）并检验特定研究参数，来检查先验概率偏移假设的有效性。我们通过大量模拟和真实数据示例证明了所提方法的有效性。所提出的方法是对处理数据集偏移工具的有益补充。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient data integration under prior probability shift.

Conventional supervised learning usually operates under the premise that data are collected from the same underlying population. However, challenges may arise when integrating new data from different populations, resulting in a phenomenon known as dataset shift. This paper focuses on prior probability shift, where the distribution of the outcome varies across datasets but the conditional distribution of features given the outcome remains the same. To tackle the challenges posed by such shift, we propose an estimation algorithm that can efficiently combine information from multiple sources. Unlike existing methods that are restricted to discrete outcomes, the proposed approach accommodates both discrete and continuous outcomes. It also handles high-dimensional covariate vectors through variable selection using an adaptive least absolute shrinkage and selection operator penalty, producing efficient estimates that possess the oracle property. Moreover, a novel semiparametric likelihood ratio test is proposed to check the validity of prior probability shift assumptions by embedding the null conditional density function into Neyman's smooth alternatives (Neyman, 1937) and testing study-specific parameters. We demonstrate the effectiveness of our proposed method through extensive simulations and a real data example. The proposed methods serve as a useful addition to the repertoire of tools for dealing dataset shifts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biometrics 生物-生物学

CiteScore

2.70

自引率

5.30%

发文量

178

审稿时长

4-8 weeks

期刊介绍： The International Biometric Society is an international society promoting the development and application of statistical and mathematical theory and methods in the biosciences, including agriculture, biomedical science and public health, ecology, environmental sciences, forestry, and allied disciplines. The Society welcomes as members statisticians, mathematicians, biological scientists, and others devoted to interdisciplinary efforts in advancing the collection and interpretation of information in the biosciences. The Society sponsors the biennial International Biometric Conference, held in sites throughout the world; through its National Groups and Regions, it also Society sponsors regional and local meetings.