交叉验证捷径：在不重新计算矩阵乘积或统计矩的情况下，高效地得出列居中和缩放的训练集 $\mathbf{X}^\mathbf{T}\mathbf{X}$ 和 $\mathbf{X}^\mathbf{T}\mathbf{Y}$

arXiv - CS - Mathematical Software Pub Date : 2024-01-24 DOI:arxiv-2401.13185

Ole-Christian Galbo Engstrøm

{"title":"交叉验证捷径：在不重新计算矩阵乘积或统计矩的情况下，高效地得出列居中和缩放的训练集 $\\mathbf{X}^\\mathbf{T}\\mathbf{X}$ 和 $\\mathbf{X}^\\mathbf{T}\\mathbf{Y}$","authors":"Ole-Christian Galbo Engstrøm","doi":"arxiv-2401.13185","DOIUrl":null,"url":null,"abstract":"Cross-validation is a widely used technique for assessing the performance of\npredictive models on unseen data. Many predictive models, such as Kernel-Based\nPartial Least-Squares (PLS) models, require the computation of\n$\\mathbf{X}^{\\mathbf{T}}\\mathbf{X}$ and $\\mathbf{X}^{\\mathbf{T}}\\mathbf{Y}$\nusing only training set samples from the input and output matrices,\n$\\mathbf{X}$ and $\\mathbf{Y}$, respectively. In this work, we present three\nalgorithms that efficiently compute these matrices. The first one allows no\ncolumn-wise preprocessing. The second one allows column-wise centering around\nthe training set means. The third one allows column-wise centering and\ncolumn-wise scaling around the training set means and standard deviations.\nDemonstrating correctness and superior computational complexity, they offer\nsignificant cross-validation speedup compared with straight-forward\ncross-validation and previous work on fast cross-validation - all without data\nleakage. Their suitability for parallelization is highlighted with an\nopen-source Python implementation combining our algorithms with Improved Kernel\nPLS.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{X}$ and $\\\\mathbf{X}^\\\\mathbf{T}\\\\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments\",\"authors\":\"Ole-Christian Galbo Engstrøm\",\"doi\":\"arxiv-2401.13185\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-validation is a widely used technique for assessing the performance of\\npredictive models on unseen data. Many predictive models, such as Kernel-Based\\nPartial Least-Squares (PLS) models, require the computation of\\n$\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{X}$ and $\\\\mathbf{X}^{\\\\mathbf{T}}\\\\mathbf{Y}$\\nusing only training set samples from the input and output matrices,\\n$\\\\mathbf{X}$ and $\\\\mathbf{Y}$, respectively. In this work, we present three\\nalgorithms that efficiently compute these matrices. The first one allows no\\ncolumn-wise preprocessing. The second one allows column-wise centering around\\nthe training set means. The third one allows column-wise centering and\\ncolumn-wise scaling around the training set means and standard deviations.\\nDemonstrating correctness and superior computational complexity, they offer\\nsignificant cross-validation speedup compared with straight-forward\\ncross-validation and previous work on fast cross-validation - all without data\\nleakage. Their suitability for parallelization is highlighted with an\\nopen-source Python implementation combining our algorithms with Improved Kernel\\nPLS.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2401.13185\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.13185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

交叉验证是一种广泛使用的技术，用于评估预测模型在未见数据上的性能。许多预测模型，如基于核的局部最小二乘（PLS）模型，需要分别使用输入矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ 和输出矩阵 $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ 的训练集样本来计算。在这项工作中，我们提出了三种能高效计算这些矩阵的算法。第一种算法允许进行无列预处理。第二种算法允许围绕训练集均值进行列居中。这些算法证明了其正确性和出色的计算复杂性，与直接向前交叉验证和以前的快速交叉验证相比，交叉验证的速度有了显著提高，而且没有数据损失。结合我们的算法和改进型 KernelPLS 的开源 Python 实现突出了它们的并行化适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Shortcutting Cross-Validation: Efficiently Deriving Column-Wise Centered and Scaled Training Set $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$ Without Full Recomputation of Matrix Products or Statistical Moments

Cross-validation is a widely used technique for assessing the performance of predictive models on unseen data. Many predictive models, such as Kernel-Based Partial Least-Squares (PLS) models, require the computation of $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ using only training set samples from the input and output matrices, $\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three algorithms that efficiently compute these matrices. The first one allows no column-wise preprocessing. The second one allows column-wise centering around the training set means. The third one allows column-wise centering and column-wise scaling around the training set means and standard deviations. Demonstrating correctness and superior computational complexity, they offer significant cross-validation speedup compared with straight-forward cross-validation and previous work on fast cross-validation - all without data leakage. Their suitability for parallelization is highlighted with an open-source Python implementation combining our algorithms with Improved Kernel PLS.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Mathematical Software

自引率

0.00%

发文量