高维数据的新范式:基于主题间属性的距离半参数特征聚合框架

IF 1 4区数学 Q3 STATISTICS & PROBABILITY

Scandinavian Journal of Statistics Pub Date : 2023-11-08 DOI:10.1111/sjos.12695

Jinyuan Liu, Xinlian Zhang, Tuo Lin, Ruohui Chen, Yuan Zhong, Tian Chen, Tsungchin Wu, Chenyu Liu, Anna Huang, Tanya T. Nguyen, Ellen E. Lee, Dilip V. Jeste, Xin M. Tu

{"title":"高维数据的新范式:基于主题间属性的距离半参数特征聚合框架","authors":"Jinyuan Liu, Xinlian Zhang, Tuo Lin, Ruohui Chen, Yuan Zhong, Tian Chen, Tsungchin Wu, Chenyu Liu, Anna Huang, Tanya T. Nguyen, Ellen E. Lee, Dilip V. Jeste, Xin M. Tu","doi":"10.1111/sjos.12695","DOIUrl":null,"url":null,"abstract":"Abstract This article proposes a distance‐based framework incentivized by the paradigm shift towards feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high‐dimensional variables using pairwise outcomes of between‐subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U‐statistics‐based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root‐n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands. This article is protected by copyright. All rights reserved.","PeriodicalId":49567,"journal":{"name":"Scandinavian Journal of Statistics","volume":"42 s195","pages":"0"},"PeriodicalIF":1.0000,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A New Paradigm for High‐dimensional Data: Distance‐Based Semiparametric Feature Aggregation Framework via Between‐Subject Attributes\",\"authors\":\"Jinyuan Liu, Xinlian Zhang, Tuo Lin, Ruohui Chen, Yuan Zhong, Tian Chen, Tsungchin Wu, Chenyu Liu, Anna Huang, Tanya T. Nguyen, Ellen E. Lee, Dilip V. Jeste, Xin M. Tu\",\"doi\":\"10.1111/sjos.12695\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract This article proposes a distance‐based framework incentivized by the paradigm shift towards feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high‐dimensional variables using pairwise outcomes of between‐subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U‐statistics‐based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root‐n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands. This article is protected by copyright. All rights reserved.\",\"PeriodicalId\":49567,\"journal\":{\"name\":\"Scandinavian Journal of Statistics\",\"volume\":\"42 s195\",\"pages\":\"0\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2023-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scandinavian Journal of Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1111/sjos.12695\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/sjos.12695","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种基于距离的框架，该框架不依赖于稀疏特征假设或基于排列的推理，它受到了高维数据向特征聚合范式转变的激励。关注基于距离的结果，在不截断任何特征的情况下保留信息，一类半参数回归已经被开发出来，它使用主体之间属性的成对结果封装了多个高维变量源。此外，我们提出了一种策略，通过基于U统计量的估计方程(UGEE)来解决它们之间的连锁相关性，这对应于它们的唯一有效影响函数(EIF)。因此，所得到的半参数估计量对分布错规范具有鲁棒性，同时具有根n一致性和渐近最优性，便于推理。本质上，该方法不仅避免了特征选择带来的信息丢失，而且提高了模型的可解释性和计算可行性。提供了人体微生物组和可穿戴设备数据的模拟研究和应用，其中特征尺寸为数万。这篇文章受版权保护。版权所有。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A New Paradigm for High‐dimensional Data: Distance‐Based Semiparametric Feature Aggregation Framework via Between‐Subject Attributes

Abstract This article proposes a distance‐based framework incentivized by the paradigm shift towards feature aggregation for high‐dimensional data, which does not rely on the sparse‐feature assumption or the permutation‐based inference. Focusing on distance‐based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high‐dimensional variables using pairwise outcomes of between‐subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U‐statistics‐based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root‐n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model's interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands. This article is protected by copyright. All rights reserved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scandinavian Journal of Statistics 数学-统计学与概率论

CiteScore

1.80

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The Scandinavian Journal of Statistics is internationally recognised as one of the leading statistical journals in the world. It was founded in 1974 by four Scandinavian statistical societies. Today more than eighty per cent of the manuscripts are submitted from outside Scandinavia. It is an international journal devoted to reporting significant and innovative original contributions to statistical methodology, both theory and applications. The journal specializes in statistical modelling showing particular appreciation of the underlying substantive research problems. The emergence of specialized methods for analysing longitudinal and spatial data is just one example of an area of important methodological development in which the Scandinavian Journal of Statistics has a particular niche.