Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores

IF 1.8 3区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biometrical Journal Pub Date : 2024-11-29 DOI:10.1002/bimj.70014

Sven Teschke, Katja Ickstadt, Alexander Munteanu

{"title":"Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores","authors":"Sven Teschke, Katja Ickstadt, Alexander Munteanu","doi":"10.1002/bimj.70014","DOIUrl":null,"url":null,"abstract":"<p>We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms (SNPs) and their interactions on health outcomes, which is a <span></span><math>\n <semantics>\n <mrow>\n <mi>p</mi>\n <mo>≫</mo>\n <mi>n</mi>\n </mrow>\n <annotation>$p\\gg n$</annotation>\n </semantics></math> problem. We introduce cross leverage scores (CLSs) to detect interactions of variables while maintaining interpretability. Using this method, it is not necessary to consider every possible interaction between variables individually, which would be very time-consuming even for moderate amounts of variables. Instead, we calculate the CLS for each variable and obtain a measure of importance for this variable. Calculating the scores remains time-consuming for large data sets. The key idea for scaling to large data is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data, which is less costly. We compare these methods to provable approximations of CLS based on sketching, which aims at summarizing data succinctly. In a simulation study, we show that the CLSs are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the CLS. This underlines their scalability to genome wide data. In addition, we evaluate the methods on real data from the HapMap project.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 8","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.70014","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrical Journal","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bimj.70014","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms (SNPs) and their interactions on health outcomes, which is a $p ≫ n$ problem. We introduce cross leverage scores (CLSs) to detect interactions of variables while maintaining interpretability. Using this method, it is not necessary to consider every possible interaction between variables individually, which would be very time-consuming even for moderate amounts of variables. Instead, we calculate the CLS for each variable and obtain a measure of importance for this variable. Calculating the scores remains time-consuming for large data sets. The key idea for scaling to large data is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data, which is less costly. We compare these methods to provable approximations of CLS based on sketching, which aims at summarizing data succinctly. In a simulation study, we show that the CLSs are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the CLS. This underlines their scalability to genome wide data. In addition, we evaluate the methods on real data from the HapMap project.

Abstract Image

查看原文本刊更多论文

利用交叉杠杆分数检测高维数据中的相互作用

我们开发了一种变量选择方法，用于在遗传学背景下的大数据回归模型中的相互作用。该方法旨在研究单核苷酸多态性（snp）及其相互作用对健康结果的影响，这是一个p > n$ p\gg n$的问题。我们引入交叉杠杆分数（cls）来检测变量的相互作用，同时保持可解释性。使用这种方法，不需要单独考虑变量之间的每个可能的相互作用，即使对于适量的变量，也会非常耗时。相反，我们计算每个变量的CLS，并获得该变量的重要性度量。对于大型数据集，计算分数仍然很耗时。扩展到大数据的关键思想是将数据分成更小的随机批次或连续的变量窗口。通过只对数据的小子集执行计算，这避免了在高维矩阵上进行复杂和耗时的计算，成本更低。我们将这些方法与基于草图的可证明的CLS近似进行比较，草图旨在简洁地总结数据。在模拟研究中，我们表明，在交互效应的意义上，cls与变量的重要性直接相关。我们进一步表明，近似方法适用于在任意大数据上有效地执行计算，同时保留CLS的相互作用检测效果。这强调了它们对全基因组数据的可扩展性。此外，我们还对来自HapMap项目的实际数据进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biometrical Journal 生物-数学与计算生物学

CiteScore

3.20

自引率

5.90%

发文量

119

审稿时长

6-12 weeks

期刊介绍： Biometrical Journal publishes papers on statistical methods and their applications in life sciences including medicine, environmental sciences and agriculture. Methodological developments should be motivated by an interesting and relevant problem from these areas. Ideally the manuscript should include a description of the problem and a section detailing the application of the new methodology to the problem. Case studies, review articles and letters to the editors are also welcome. Papers containing only extensive mathematical theory are not suitable for publication in Biometrical Journal.