主成分分析中的最小成本压缩风险

Pub Date : 2022-12-28 DOI:10.1111/anzs.12378
Bhargab Chattopadhyay, Swarnali Banerjee
{"title":"主成分分析中的最小成本压缩风险","authors":"Bhargab Chattopadhyay,&nbsp;Swarnali Banerjee","doi":"10.1111/anzs.12378","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.</p>\n </div>","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Minimum cost-compression risk in principal component analysis\",\"authors\":\"Bhargab Chattopadhyay,&nbsp;Swarnali Banerjee\",\"doi\":\"10.1111/anzs.12378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.</p>\\n </div>\",\"PeriodicalId\":0,\"journal\":{\"name\":\"\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0,\"publicationDate\":\"2022-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/anzs.12378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

主成分分析(PCA)是一种流行的多元分析工具,它可以在不丢失太多信息的情况下进行降维。包含大量顺序到达的特征的数据向量可能彼此相关。在线PCA是一种有效的算法。现有的在线PCA研究工作围绕着提出有效的可扩展更新算法,只关注压缩损失。它们没有考虑数据集的大小,数据向量的进一步到达可以被终止,并且可以应用降维。众所周知,数据集大小有助于减少压缩损失——数据集大小越小,压缩损失越大,而数据集大小越大,压缩损失越小。然而,通过增加数据集大小来减少压缩损失将增加总数据收集成本。在本文中,我们超越了与在线PCA相关的可扩展性和更新问题,并专注于优化考虑压缩损失和数据收集成本的成本-压缩损失。我们使用两阶段PCA算法最小化相应的风险。所得到的两阶段算法是一种快速而有效的在线PCA替代方案,并且在不假设特定数据分布的情况下显示出有吸引力的收敛特性。实验研究表明了类似的结果,并利用实际数据提供了进一步的说明。作为扩展,本文还讨论了一种多阶段PCA算法。考虑到在线数据的时间复杂度,两阶段主成分分析算法比多阶段主成分分析算法更受重视。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
分享
查看原文
Minimum cost-compression risk in principal component analysis

Principal Component Analysis (PCA) is a popular multivariate analytic tool which can be used for dimension reduction without losing much information. Data vectors containing a large number of features arriving sequentially may be correlated with each other. An effective algorithm for such situations is online PCA. Existing Online PCA research works revolve around proposing efficient scalable updating algorithms focusing on compression loss only. They do not take into account the size of the dataset at which further arrival of data vectors can be terminated and dimension reduction can be applied. It is well known that the dataset size contributes to reducing the compression loss – the smaller the dataset size, the larger the compression loss while larger the dataset size, the lesser the compression loss. However, the reduction in compression loss by increasing dataset size will increase the total data collection cost. In this paper, we move beyond the scalability and updation problems related to Online PCA and focus on optimising a cost-compression loss which considers the compression loss and data collection cost. We minimise the corresponding risk using a two-stage PCA algorithm. The resulting two-stage algorithm is a fast and an efficient alternative to Online PCA and is shown to exhibit attractive convergence properties with no assumption on specific data distributions. Experimental studies demonstrate similar results and further illustrations are provided using real data. As an extension, a multi-stage PCA algorithm is discussed as well. Given the time complexity, the two-stage PCA algorithm is emphasised over the multi-stage PCA algorithm for online data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信