{"title":"交叉验证捷径:在不重新计算矩阵乘积或统计矩的情况下,高效地得出列居中和缩放的训练集 $\\mathbf{X}^\\mathbf{T}\\mathbf{X}$ 和 $\\mathbf{X}^\\mathbf{T}\\mathbf{Y}$","authors":"Ole-Christian Galbo Engstrøm","doi":"arxiv-2401.13185","DOIUrl":null,"url":null,"abstract":"Cross-validation is a widely used technique for assessing the performance of\npredictive models on unseen data. Many predictive models, such as Kernel-Based\nPartial Least-Squares (PLS) models, require the computation of\n$\\mathbf{X}^{\\mathbf{T}}\\mathbf{X}$ and $\\mathbf{X}^{\\mathbf{T}}\\mathbf{Y}$\nusing only training set samples from the input and output matrices,\n$\\mathbf{X}$ and $\\mathbf{Y}$, respectively. In this work, we present three\nalgorithms that efficiently compute these matrices. The first one allows no\ncolumn-wise preprocessing. The second one allows column-wise centering around\nthe training set means. The third one allows column-wise centering and\ncolumn-wise scaling around the training set means and standard deviations.\nDemonstrating correctness and superior computational complexity, they offer\nsignificant cross-validation speedup compared with straight-forward\ncross-validation and previous work on fast cross-validation - all without data\nleakage. Their suitability for parallelization is highlighted with an\nopen-source Python implementation combining our algorithms with Improved Kernel\nPLS.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{X}$ and $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments\",\"authors\":\"Ole-Christian Galbo Engstrøm\",\"doi\":\"arxiv-2401.13185\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-validation is a widely used technique for assessing the performance of\\npredictive models on unseen data. Many predictive models, such as Kernel-Based\\nPartial Least-Squares (PLS) models, require the computation of\\n$\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{X}$ and $\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{Y}$\\nusing only training set samples from the input and output matrices,\\n$\\\\mathbf{X}$ and $\\\\mathbf{Y}$, respectively. In this work, we present three\\nalgorithms that efficiently compute these matrices. The first one allows no\\ncolumn-wise preprocessing. The second one allows column-wise centering around\\nthe training set means. The third one allows column-wise centering and\\ncolumn-wise scaling around the training set means and standard deviations.\\nDemonstrating correctness and superior computational complexity, they offer\\nsignificant cross-validation speedup compared with straight-forward\\ncross-validation and previous work on fast cross-validation - all without data\\nleakage. Their suitability for parallelization is highlighted with an\\nopen-source Python implementation combining our algorithms with Improved Kernel\\nPLS.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2401.13185\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.13185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments
Cross-validation is a widely used technique for assessing the performance of
predictive models on unseen data. Many predictive models, such as Kernel-Based
Partial Least-Squares (PLS) models, require the computation of
$\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$
using only training set samples from the input and output matrices,
$\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three
algorithms that efficiently compute these matrices. The first one allows no
column-wise preprocessing. The second one allows column-wise centering around
the training set means. The third one allows column-wise centering and
column-wise scaling around the training set means and standard deviations.
Demonstrating correctness and superior computational complexity, they offer
significant cross-validation speedup compared with straight-forward
cross-validation and previous work on fast cross-validation - all without data
leakage. Their suitability for parallelization is highlighted with an
open-source Python implementation combining our algorithms with Improved Kernel
PLS.