{"title":"Fault Tolerant Support Vector Machines","authors":"Sameh M. Shohdy, Abhinav Vishnu, G. Agrawal","doi":"10.1109/ICPP.2016.75","DOIUrl":null,"url":null,"abstract":"Support Vector Machines (SVM) is a popular Machine Learning algorithm, which is used for building classifiers and models. Parallel implementations of SVM, which can run on large scale supercomputers, are becoming commonplace. However, these supercomputers -- designed under constraints of data movement -- frequently observe faults in compute devices. Many device faults manifest as permanent process/node failures. In this paper, we present several approaches for designing fault tolerant SVM algorithms. First, we present an in-depth analysis to identify the critical data structures, and build baseline algorithms that simply periodically checkpoint these data structures. Next, we propose a novel algorithm, which requires no inter-node data movement for checkpointing, and only O(n2/p2) recovery time -- a small fraction of the expected O(n3/p) time-complexity of SVM. We implement these algorithms and evaluate them on a large scale cluster. Our evaluation indicates that the overall data movement for checkpointing in the baseline algorithm can be up to 100x the dataset size!, while the proposed novel algorithm is completely communication-free of checkpointing. In addition, it saves up to 20x space, while providing better (by an average of 5.5x speedup on 256 cores) recovery time than the baseline algorithm with different number of checkpoints. The experiments also show that our communication avoiding algorithm outperforms Spark MLLib SVM implementation by an average of 6.4x with 256 cores in the case of failure.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 45th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2016.75","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Support Vector Machines (SVM) is a popular Machine Learning algorithm, which is used for building classifiers and models. Parallel implementations of SVM, which can run on large scale supercomputers, are becoming commonplace. However, these supercomputers -- designed under constraints of data movement -- frequently observe faults in compute devices. Many device faults manifest as permanent process/node failures. In this paper, we present several approaches for designing fault tolerant SVM algorithms. First, we present an in-depth analysis to identify the critical data structures, and build baseline algorithms that simply periodically checkpoint these data structures. Next, we propose a novel algorithm, which requires no inter-node data movement for checkpointing, and only O(n2/p2) recovery time -- a small fraction of the expected O(n3/p) time-complexity of SVM. We implement these algorithms and evaluate them on a large scale cluster. Our evaluation indicates that the overall data movement for checkpointing in the baseline algorithm can be up to 100x the dataset size!, while the proposed novel algorithm is completely communication-free of checkpointing. In addition, it saves up to 20x space, while providing better (by an average of 5.5x speedup on 256 cores) recovery time than the baseline algorithm with different number of checkpoints. The experiments also show that our communication avoiding algorithm outperforms Spark MLLib SVM implementation by an average of 6.4x with 256 cores in the case of failure.