在并行和分布式系统上精确、快速和可扩展的核岭回归

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-05-01 DOI:10.1145/3205289.3205290

Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc

{"title":"在并行和分布式系统上精确、快速和可扩展的核岭回归","authors":"Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc","doi":"10.1145/3205289.3205290","DOIUrl":null,"url":null,"abstract":"Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems\",\"authors\":\"Yang You, J. Demmel, Cho-Jui Hsieh, R. Vuduc\",\"doi\":\"10.1145/3205289.3205290\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).\",\"PeriodicalId\":441217,\"journal\":{\"name\":\"Proceedings of the 2018 International Conference on Supercomputing\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3205289.3205290\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

摘要

核岭回归(KRR)是机器学习的一种基本方法。给定一个n × d的数据矩阵作为输入，传统的实现需要Θ(n2)内存来形成一个n × n的内核矩阵，并且需要Θ(n3)内存来计算最终的模型。这些时间和存储成本使KRR无法扩展到大型数据集。例如，即使在相对较小的数据集上(520k × 90的输入需要357 MB)， KRR也需要2tb内存来存储内核矩阵。原因是，在实际应用程序中，n通常比d大得多。另一方面，弱可伸缩性成为一个问题:如果我们保持d和n/p固定为p的增长(p是#机器)，则所需的内存以每个处理器Θ(p)的速度增长，而磁盘以每个处理器Θ(p2)的速度增长。在完美的弱伸缩情况下，每个处理器所需的内存和flops都以Θ(1)的速度增长(即内存和flops是恒定的)。传统的分布式KRR实现(DKRR)从96个处理器到1536个处理器的弱扩展效率仅为0.32%。我们提出了两种新的方法来解决这些问题:平衡KRR (BKRR)和K-means KRR (KKRR)。这些方法考虑了将输入数据集划分为p个不同部分的替代方法，生成p个不同的模型，然后从中选择最佳模型。与传统实现相比，KKRR2 (KKRR的优化版本)将弱缩放效率从0.32%提高到38%，并在使用相同的数据和相同的硬件(1536个处理器)的情况下获得相同的精度，实现了591倍的加速。BKRR2(优化版的BKRR)在各种数据集上使用更少的训练时间实现了比目前最快的方法更高的准确率。对于只需要近似解的应用，BKRR2将弱缩放效率提高到92%，实现了3505倍的加速(理论加速:4096倍)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems

Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. The reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. We propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 International Conference on Supercomputing

自引率

0.00%

发文量