交叉验证捷径:在不重新计算矩阵乘积或统计矩的情况下,高效地得出列居中和缩放的训练集 $\mathbf{X}^\mathbf{T}\mathbf{X}$ 和 $\mathbf{X}^\mathbf{T}\mathbf{Y}$

Ole-Christian Galbo Engstrøm
{"title":"交叉验证捷径:在不重新计算矩阵乘积或统计矩的情况下,高效地得出列居中和缩放的训练集 $\\mathbf{X}^\\mathbf{T}\\mathbf{X}$ 和 $\\mathbf{X}^\\mathbf{T}\\mathbf{Y}$","authors":"Ole-Christian Galbo Engstrøm","doi":"arxiv-2401.13185","DOIUrl":null,"url":null,"abstract":"Cross-validation is a widely used technique for assessing the performance of\npredictive models on unseen data. Many predictive models, such as Kernel-Based\nPartial Least-Squares (PLS) models, require the computation of\n$\\mathbf{X}^{\\mathbf{T}}\\mathbf{X}$ and $\\mathbf{X}^{\\mathbf{T}}\\mathbf{Y}$\nusing only training set samples from the input and output matrices,\n$\\mathbf{X}$ and $\\mathbf{Y}$, respectively. In this work, we present three\nalgorithms that efficiently compute these matrices. The first one allows no\ncolumn-wise preprocessing. The second one allows column-wise centering around\nthe training set means. The third one allows column-wise centering and\ncolumn-wise scaling around the training set means and standard deviations.\nDemonstrating correctness and superior computational complexity, they offer\nsignificant cross-validation speedup compared with straight-forward\ncross-validation and previous work on fast cross-validation - all without data\nleakage. Their suitability for parallelization is highlighted with an\nopen-source Python implementation combining our algorithms with Improved Kernel\nPLS.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{X}$ and $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments\",\"authors\":\"Ole-Christian Galbo Engstrøm\",\"doi\":\"arxiv-2401.13185\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-validation is a widely used technique for assessing the performance of\\npredictive models on unseen data. Many predictive models, such as Kernel-Based\\nPartial Least-Squares (PLS) models, require the computation of\\n$\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{X}$ and $\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{Y}$\\nusing only training set samples from the input and output matrices,\\n$\\\\mathbf{X}$ and $\\\\mathbf{Y}$, respectively. In this work, we present three\\nalgorithms that efficiently compute these matrices. The first one allows no\\ncolumn-wise preprocessing. The second one allows column-wise centering around\\nthe training set means. The third one allows column-wise centering and\\ncolumn-wise scaling around the training set means and standard deviations.\\nDemonstrating correctness and superior computational complexity, they offer\\nsignificant cross-validation speedup compared with straight-forward\\ncross-validation and previous work on fast cross-validation - all without data\\nleakage. Their suitability for parallelization is highlighted with an\\nopen-source Python implementation combining our algorithms with Improved Kernel\\nPLS.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2401.13185\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.13185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

交叉验证是一种广泛使用的技术,用于评估预测模型在未见数据上的性能。许多预测模型,如基于核的局部最小二乘(PLS)模型,需要分别使用输入矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ 和输出矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ 的训练集样本来计算。在这项工作中,我们提出了三种能高效计算这些矩阵的算法。第一种算法允许进行无列预处理。第二种算法允许围绕训练集均值进行列居中。这些算法证明了其正确性和出色的计算复杂性,与直接向前交叉验证和以前的快速交叉验证相比,交叉验证的速度有了显著提高,而且没有数据损失。结合我们的算法和改进型 KernelPLS 的开源 Python 实现突出了它们的并行化适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments
Cross-validation is a widely used technique for assessing the performance of predictive models on unseen data. Many predictive models, such as Kernel-Based Partial Least-Squares (PLS) models, require the computation of $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ using only training set samples from the input and output matrices, $\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three algorithms that efficiently compute these matrices. The first one allows no column-wise preprocessing. The second one allows column-wise centering around the training set means. The third one allows column-wise centering and column-wise scaling around the training set means and standard deviations. Demonstrating correctness and superior computational complexity, they offer significant cross-validation speedup compared with straight-forward cross-validation and previous work on fast cross-validation - all without data leakage. Their suitability for parallelization is highlighted with an open-source Python implementation combining our algorithms with Improved Kernel PLS.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信