分布式流PCA的近最优样本复杂度算法

2023 57th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2023-03-22 DOI:10.1109/CISS56502.2023.10089668

Muhammad Zulqarnain, Arpita Gang, W. Bajwa

{"title":"分布式流PCA的近最优样本复杂度算法","authors":"Muhammad Zulqarnain, Arpita Gang, W. Bajwa","doi":"10.1109/CISS56502.2023.10089668","DOIUrl":null,"url":null,"abstract":"The accuracy of many downstream machine learning algorithms is tied to the training data having uncorrelated features. With the modern-day data often being streaming in nature, geographically distributed, and having large dimensions, it is paramount to apply both uncorrelated feature learning and dimensionality reduction techniques in this scenario. Principal Component Analysis (PCA) is a state-of-the-art tool that simultaneously yields uncorrelated features and reduces data dimensions by projecting data onto the eigenvectors of the population covariance matrix. This paper introduces a novel algorithm called Consensus-DIstributEd Generalized Oja (C-DIEGO), which is based on Oja's method, to estimate the dominant eigenvector of a population covariance matrix in a distributed, streaming setting. The algorithm considers a distributed network of arbitrarily connected nodes without a central coordinator and assumes data samples continuously arrive at the individual nodes in a streaming manner. It is established in the paper that C-DIEGO can achieve an order-optimal convergence rate if nodes in the network are allowed to have enough consensus rounds per algorithmic iteration. Numerical results are also reported in the paper that showcase the efficacy of the proposed algorithm.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"C-DIEGO: An Algorithm with Near-Optimal Sample Complexity for Distributed, Streaming PCA\",\"authors\":\"Muhammad Zulqarnain, Arpita Gang, W. Bajwa\",\"doi\":\"10.1109/CISS56502.2023.10089668\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The accuracy of many downstream machine learning algorithms is tied to the training data having uncorrelated features. With the modern-day data often being streaming in nature, geographically distributed, and having large dimensions, it is paramount to apply both uncorrelated feature learning and dimensionality reduction techniques in this scenario. Principal Component Analysis (PCA) is a state-of-the-art tool that simultaneously yields uncorrelated features and reduces data dimensions by projecting data onto the eigenvectors of the population covariance matrix. This paper introduces a novel algorithm called Consensus-DIstributEd Generalized Oja (C-DIEGO), which is based on Oja's method, to estimate the dominant eigenvector of a population covariance matrix in a distributed, streaming setting. The algorithm considers a distributed network of arbitrarily connected nodes without a central coordinator and assumes data samples continuously arrive at the individual nodes in a streaming manner. It is established in the paper that C-DIEGO can achieve an order-optimal convergence rate if nodes in the network are allowed to have enough consensus rounds per algorithmic iteration. Numerical results are also reported in the paper that showcase the efficacy of the proposed algorithm.\",\"PeriodicalId\":243775,\"journal\":{\"name\":\"2023 57th Annual Conference on Information Sciences and Systems (CISS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 57th Annual Conference on Information Sciences and Systems (CISS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISS56502.2023.10089668\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089668","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

许多下游机器学习算法的准确性与具有不相关特征的训练数据有关。由于现代数据通常在本质上是流的，地理分布的，并且具有很大的维度，因此在这种情况下应用不相关的特征学习和降维技术是至关重要的。主成分分析(PCA)是一种最先进的工具，通过将数据投影到总体协方差矩阵的特征向量上，同时产生不相关的特征并降低数据维数。本文介绍了一种基于Oja方法的共识-分布式广义Oja (C-DIEGO)算法，用于估计分布式流环境下总体协方差矩阵的优势特征向量。该算法考虑一个没有中央协调器的任意连接节点的分布式网络，并假设数据样本以流的方式连续到达各个节点。在每次算法迭代中，如果允许网络中的节点有足够的共识轮数，则C-DIEGO可以达到一个阶最优的收敛速度。数值结果表明了该算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

C-DIEGO: An Algorithm with Near-Optimal Sample Complexity for Distributed, Streaming PCA

The accuracy of many downstream machine learning algorithms is tied to the training data having uncorrelated features. With the modern-day data often being streaming in nature, geographically distributed, and having large dimensions, it is paramount to apply both uncorrelated feature learning and dimensionality reduction techniques in this scenario. Principal Component Analysis (PCA) is a state-of-the-art tool that simultaneously yields uncorrelated features and reduces data dimensions by projecting data onto the eigenvectors of the population covariance matrix. This paper introduces a novel algorithm called Consensus-DIstributEd Generalized Oja (C-DIEGO), which is based on Oja's method, to estimate the dominant eigenvector of a population covariance matrix in a distributed, streaming setting. The algorithm considers a distributed network of arbitrarily connected nodes without a central coordinator and assumes data samples continuously arrive at the individual nodes in a streaming manner. It is established in the paper that C-DIEGO can achieve an order-optimal convergence rate if nodes in the network are allowed to have enough consensus rounds per algorithmic iteration. Numerical results are also reported in the paper that showcase the efficacy of the proposed algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 57th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量