Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores

IF 1.3 3区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Sven Teschke, Katja Ickstadt, Alexander Munteanu
{"title":"Detecting Interactions in High-Dimensional Data Using Cross Leverage Scores","authors":"Sven Teschke,&nbsp;Katja Ickstadt,&nbsp;Alexander Munteanu","doi":"10.1002/bimj.70014","DOIUrl":null,"url":null,"abstract":"<p>We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms (SNPs) and their interactions on health outcomes, which is a <span></span><math>\n <semantics>\n <mrow>\n <mi>p</mi>\n <mo>≫</mo>\n <mi>n</mi>\n </mrow>\n <annotation>$p\\gg n$</annotation>\n </semantics></math> problem. We introduce cross leverage scores (CLSs) to detect interactions of variables while maintaining interpretability. Using this method, it is not necessary to consider every possible interaction between variables individually, which would be very time-consuming even for moderate amounts of variables. Instead, we calculate the CLS for each variable and obtain a measure of importance for this variable. Calculating the scores remains time-consuming for large data sets. The key idea for scaling to large data is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data, which is less costly. We compare these methods to provable approximations of CLS based on sketching, which aims at summarizing data succinctly. In a simulation study, we show that the CLSs are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the CLS. This underlines their scalability to genome wide data. In addition, we evaluate the methods on real data from the HapMap project.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":"66 8","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.70014","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrical Journal","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bimj.70014","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

We develop a variable selection method for interactions in regression models on large data in the context of genetics. The method is intended for investigating the influence of single-nucleotide polymorphisms (SNPs) and their interactions on health outcomes, which is a p n $p\gg n$ problem. We introduce cross leverage scores (CLSs) to detect interactions of variables while maintaining interpretability. Using this method, it is not necessary to consider every possible interaction between variables individually, which would be very time-consuming even for moderate amounts of variables. Instead, we calculate the CLS for each variable and obtain a measure of importance for this variable. Calculating the scores remains time-consuming for large data sets. The key idea for scaling to large data is to divide the data into smaller random batches or consecutive windows of variables. This avoids complex and time-consuming computations on high-dimensional matrices by performing the computations only for small subsets of the data, which is less costly. We compare these methods to provable approximations of CLS based on sketching, which aims at summarizing data succinctly. In a simulation study, we show that the CLSs are directly linked to the importance of a variable in the sense of an interaction effect. We further show that the approximation approaches are appropriate for performing the calculations efficiently on arbitrarily large data while preserving the interaction detection effect of the CLS. This underlines their scalability to genome wide data. In addition, we evaluate the methods on real data from the HapMap project.

Abstract Image

利用交叉杠杆分数检测高维数据中的相互作用
我们开发了一种变量选择方法,用于在遗传学背景下的大数据回归模型中的相互作用。该方法旨在研究单核苷酸多态性(snp)及其相互作用对健康结果的影响,这是一个p > n$ p\gg n$的问题。我们引入交叉杠杆分数(cls)来检测变量的相互作用,同时保持可解释性。使用这种方法,不需要单独考虑变量之间的每个可能的相互作用,即使对于适量的变量,也会非常耗时。相反,我们计算每个变量的CLS,并获得该变量的重要性度量。对于大型数据集,计算分数仍然很耗时。扩展到大数据的关键思想是将数据分成更小的随机批次或连续的变量窗口。通过只对数据的小子集执行计算,这避免了在高维矩阵上进行复杂和耗时的计算,成本更低。我们将这些方法与基于草图的可证明的CLS近似进行比较,草图旨在简洁地总结数据。在模拟研究中,我们表明,在交互效应的意义上,cls与变量的重要性直接相关。我们进一步表明,近似方法适用于在任意大数据上有效地执行计算,同时保留CLS的相互作用检测效果。这强调了它们对全基因组数据的可扩展性。此外,我们还对来自HapMap项目的实际数据进行了评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Biometrical Journal
Biometrical Journal 生物-数学与计算生物学
CiteScore
3.20
自引率
5.90%
发文量
119
审稿时长
6-12 weeks
期刊介绍: Biometrical Journal publishes papers on statistical methods and their applications in life sciences including medicine, environmental sciences and agriculture. Methodological developments should be motivated by an interesting and relevant problem from these areas. Ideally the manuscript should include a description of the problem and a section detailing the application of the new methodology to the problem. Case studies, review articles and letters to the editors are also welcome. Papers containing only extensive mathematical theory are not suitable for publication in Biometrical Journal.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信