关于“分布式统计推断综述”一文的讨论

IF 1.3 Q3 STATISTICS & PROBABILITY

Statistical Theory and Related Fields Pub Date : 2021-12-28 DOI:10.1080/24754269.2021.2017544

Heng Lian

{"title":"关于“分布式统计推断综述”一文的讨论","authors":"Heng Lian","doi":"10.1080/24754269.2021.2017544","DOIUrl":null,"url":null,"abstract":"The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an interesting direction of research for statisticians. Currently, decentralized systems are investigated from a purely optimizational point of view,without incorporating statistical properties (Ram et al., 2010; Yuan et al., 2016). Finally, on the theoretical side, the distributed statistical inference problem provides opportunities and challenges for investigating the fundamental limit (i.e., lower bounds) in performances achievable taking into account communicational, computational and statistical trade-offs. For example, in various models, if a one-short approach is used, then there is a limit in the number of machines allowed in the system and more machines will lead to a suboptimal statistical convergence rate. On the other hand, when multiple communications are allowed, the constraint on the number of machines can be relaxed or even removed. This represents a communicational and statistical trade-off. As another example, the computational and statistical trade-off has already been explored in many works (Khetan & Oh, 2018; L. Wang et al., 2019; T. Wang et al., 2016). The question is how would this change when communications come into play. A general framework taking into account computational, statistical, and communication costs is called for, which would significantly advance the understanding of distributed estimation and inference.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"100 - 101"},"PeriodicalIF":1.3000,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Discussion of the paper ‘A review of distributed statistical inference’\",\"authors\":\"Heng Lian\",\"doi\":\"10.1080/24754269.2021.2017544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an interesting direction of research for statisticians. Currently, decentralized systems are investigated from a purely optimizational point of view,without incorporating statistical properties (Ram et al., 2010; Yuan et al., 2016). Finally, on the theoretical side, the distributed statistical inference problem provides opportunities and challenges for investigating the fundamental limit (i.e., lower bounds) in performances achievable taking into account communicational, computational and statistical trade-offs. For example, in various models, if a one-short approach is used, then there is a limit in the number of machines allowed in the system and more machines will lead to a suboptimal statistical convergence rate. On the other hand, when multiple communications are allowed, the constraint on the number of machines can be relaxed or even removed. This represents a communicational and statistical trade-off. As another example, the computational and statistical trade-off has already been explored in many works (Khetan & Oh, 2018; L. Wang et al., 2019; T. Wang et al., 2016). The question is how would this change when communications come into play. A general framework taking into account computational, statistical, and communication costs is called for, which would significantly advance the understanding of distributed estimation and inference.\",\"PeriodicalId\":22070,\"journal\":{\"name\":\"Statistical Theory and Related Fields\",\"volume\":\"6 1\",\"pages\":\"100 - 101\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2021-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Theory and Related Fields\",\"FirstCategoryId\":\"96\",\"ListUrlMain\":\"https://doi.org/10.1080/24754269.2021.2017544\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Theory and Related Fields","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1080/24754269.2021.2017544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

摘要

值得祝贺的是，作者们对这一新兴领域做出了及时的贡献，并进行了全面的综述，这必将吸引更多的研究人员进入这一领域。在最简单的一次性方法中，整个数据集分布在多台机器上，每台机器仅基于局部数据计算局部估计，中央机器执行聚合计算作为最终处理步骤。在更复杂的设置中，执行多次通信，通常在本地机器和中央机器之间还传递一阶信息（梯度）和/或二阶信息（Hession矩阵）。这篇综述清楚地将该领域的现有工作分为几个部分，考虑了参数回归、非参数回归和其他模型，包括主成分分析和变量筛选。在这次讨论中，我将根据自己的个人经验，考虑在这一领域未来可能的一些方向。第一个问题是将分治估计与传统统计分析中未使用的一些有效的局部算法相结合。这是因为，由于实际或理论上可以使用的机器数量受到严格限制（例如，当使用一次性方法时，可以使用的机械数量为O（√N）），每个工人机器上的样本量仍然很大。换句话说，即使在分区之后，局部样本大小可能仍然太大，无法通过传统算法进行处理。在这种情况下，应该在每个本地机器上使用更有效的算法（可能接近精确解的算法）。这里的重要问题是，使用这样的算法是否可以保留最佳统计特性。Lian等人最近报道了一个这样的尝试，其答案是肯定的。（2021）。在这项工作中，我们在非参数回归的RKHS框架中使用随机草图（随机投影）进行核回归。随机草图的使用降低了每台工作机器的计算复杂性，同时仍然保持了最佳的统计收敛速度。我们期望沿着这样一个方向的组合在各种设置中都是有用的，并且对于不同的设置，需要不同的高效算法来计算一些近似解。第二个问题是将研究扩展到工作服务器模型之外。统计学文献中大多数现有的方法都集中在集中式系统上，在集中式系统中，只有一台专用机器与所有其他机器进行通信，并协调计算和通信。然而，在许多现代应用中，这种系统是罕见且不可靠的，因为中央机器的故障将是灾难性的。在没有这种专门的中央机器的分散系统中，考虑同步或异步的统计推理，将是统计学家感兴趣的研究方向。目前，分散系统是从纯粹的优化角度进行研究的，没有纳入统计特性（Ram等人，2010；袁等人，2016）。最后，在理论方面，分布式统计推理问题为研究考虑通信、计算和统计权衡的可实现性能的基本极限（即下限）提供了机会和挑战。例如，在各种模型中，如果使用一个简短的方法，那么系统中允许的机器数量是有限的，并且更多的机器将导致次优的统计收敛率。另一方面，当允许多个通信时，可以放宽甚至取消对机器数量的限制。这代表了一种沟通和统计上的权衡。另一个例子是，计算和统计权衡已经在许多工作中进行了探索（Khetan&Oh，2018；L.Wang等人，2019；T.Wang等人，2016）。问题是，当沟通发挥作用时，这种情况会如何改变。需要一个考虑计算、统计和通信成本的通用框架，这将大大促进对分布式估计和推理的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Discussion of the paper ‘A review of distributed statistical inference’

The authors should be congratulated on their timely contribution to this emerging field with a comprehensive review, which will certainly attract more researchers into this area. In the simplest one-shot approach, the entire dataset is distributed on multiple machines, and each machine computes a local estimate based on local data only, and a central machine performs an aggregation calculation as a final processing step. In more complicated settings, multiple communications are carried out, typically passing also first-order information (gradient) and/or second-order information (Hession matrix) between local machines and the central machine. This review clearly separates the existing works in this area into several sections, considering parameter regression, nonparametric regression, and other models including principal component analysis and variable screening. In this discussion, I will consider some possible future directions that can be entertained in this area, based on my own personal experience. The first problem is a combination of divide-and-conquer estimation with some efficient local algorithm not used in traditional statistical analysis. This is motivated by that, due to the stringent constraint on the number of machines that can be used either practically or in theory (for example, when using a one-shot approach, the number ofmachines that can be used isO( √ N)), the sample size on each worker machine can still be large. In other words, even after partitioning, the local sample sizemay still be too large to be processed by traditional algorithms. In such a case, a more efficient algorithm (one that possibly approximates the exact solution) should be used on each local machine. The important question here is whether the optimal statistical properties can be retained using such an algorithm. One such attempt with an affirmative answer is recently reported in Lian et al. (2021). In this work, we use random sketches (random projection) for kernel regression in anRKHS framework for nonparametric regression. Use of random sketches reduces the computational complexity on each worker machine, and at the same time still retains the optimal statistical convergence rate. We expect combinations along such a direction can be useful in various settings, and for different settings different efficient algorithms to compute some approximate solution are called for. The second problem is to extend the studies beyond the worker-server model. Most of the existing methods in the statistics literature are focused on the centralized system where there is a single special machine that communicates with all others and coordinates computation and communication. However, in many modern applications, such systems are rare and unreliable since the failure of the central machine would be disastrous. Consideration of statistical inference in a decentralized system, synchronous or asynchronous, where there is no such specialized central machine, would be an interesting direction of research for statisticians. Currently, decentralized systems are investigated from a purely optimizational point of view,without incorporating statistical properties (Ram et al., 2010; Yuan et al., 2016). Finally, on the theoretical side, the distributed statistical inference problem provides opportunities and challenges for investigating the fundamental limit (i.e., lower bounds) in performances achievable taking into account communicational, computational and statistical trade-offs. For example, in various models, if a one-short approach is used, then there is a limit in the number of machines allowed in the system and more machines will lead to a suboptimal statistical convergence rate. On the other hand, when multiple communications are allowed, the constraint on the number of machines can be relaxed or even removed. This represents a communicational and statistical trade-off. As another example, the computational and statistical trade-off has already been explored in many works (Khetan & Oh, 2018; L. Wang et al., 2019; T. Wang et al., 2016). The question is how would this change when communications come into play. A general framework taking into account computational, statistical, and communication costs is called for, which would significantly advance the understanding of distributed estimation and inference.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Theory and Related Fields Mathematics-Analysis

CiteScore

0.90

自引率

20.00%

发文量