Fault Tolerant Support Vector Machines

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI:10.1109/ICPP.2016.75

Sameh M. Shohdy, Abhinav Vishnu, G. Agrawal

{"title":"Fault Tolerant Support Vector Machines","authors":"Sameh M. Shohdy, Abhinav Vishnu, G. Agrawal","doi":"10.1109/ICPP.2016.75","DOIUrl":null,"url":null,"abstract":"Support Vector Machines (SVM) is a popular Machine Learning algorithm, which is used for building classifiers and models. Parallel implementations of SVM, which can run on large scale supercomputers, are becoming commonplace. However, these supercomputers -- designed under constraints of data movement -- frequently observe faults in compute devices. Many device faults manifest as permanent process/node failures. In this paper, we present several approaches for designing fault tolerant SVM algorithms. First, we present an in-depth analysis to identify the critical data structures, and build baseline algorithms that simply periodically checkpoint these data structures. Next, we propose a novel algorithm, which requires no inter-node data movement for checkpointing, and only O(n2/p2) recovery time -- a small fraction of the expected O(n3/p) time-complexity of SVM. We implement these algorithms and evaluate them on a large scale cluster. Our evaluation indicates that the overall data movement for checkpointing in the baseline algorithm can be up to 100x the dataset size!, while the proposed novel algorithm is completely communication-free of checkpointing. In addition, it saves up to 20x space, while providing better (by an average of 5.5x speedup on 256 cores) recovery time than the baseline algorithm with different number of checkpoints. The experiments also show that our communication avoiding algorithm outperforms Spark MLLib SVM implementation by an average of 6.4x with 256 cores in the case of failure.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.75","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Support Vector Machines (SVM) is a popular Machine Learning algorithm, which is used for building classifiers and models. Parallel implementations of SVM, which can run on large scale supercomputers, are becoming commonplace. However, these supercomputers -- designed under constraints of data movement -- frequently observe faults in compute devices. Many device faults manifest as permanent process/node failures. In this paper, we present several approaches for designing fault tolerant SVM algorithms. First, we present an in-depth analysis to identify the critical data structures, and build baseline algorithms that simply periodically checkpoint these data structures. Next, we propose a novel algorithm, which requires no inter-node data movement for checkpointing, and only O(n2/p2) recovery time -- a small fraction of the expected O(n3/p) time-complexity of SVM. We implement these algorithms and evaluate them on a large scale cluster. Our evaluation indicates that the overall data movement for checkpointing in the baseline algorithm can be up to 100x the dataset size!, while the proposed novel algorithm is completely communication-free of checkpointing. In addition, it saves up to 20x space, while providing better (by an average of 5.5x speedup on 256 cores) recovery time than the baseline algorithm with different number of checkpoints. The experiments also show that our communication avoiding algorithm outperforms Spark MLLib SVM implementation by an average of 6.4x with 256 cores in the case of failure.

查看原文本刊更多论文

容错支持向量机

支持向量机(SVM)是一种流行的机器学习算法，用于构建分类器和模型。支持向量机的并行实现可以在大型超级计算机上运行，正变得越来越普遍。然而，这些在数据移动约束下设计的超级计算机经常观察到计算设备的故障。许多设备故障表现为永久的进程/节点故障。本文提出了几种设计容错支持向量机算法的方法。首先，我们进行了深入的分析，以确定关键数据结构，并构建基线算法，简单地定期检查这些数据结构。接下来，我们提出了一种新的算法，它不需要节点间的数据移动来进行检查点，只有O(n2/p2)的恢复时间——这是支持向量机预期的O(n3/p)时间复杂度的一小部分。我们实现了这些算法，并在一个大规模的集群上对它们进行了评估。我们的评估表明，基线算法中检查点的整体数据移动可以达到数据集大小的100倍!，而提出的新算法是完全无通信的检查点。此外，它可以节省高达20倍的空间，同时提供比具有不同检查点数量的基线算法更好的恢复时间(256核时平均加速5.5倍)。实验还表明，在256核故障情况下，我们的通信避免算法比Spark MLLib SVM实现平均高出6.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 45th International Conference on Parallel Processing (ICPP)

自引率

0.00%

发文量