{"title":"Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs","authors":"Anshumali Shrivastava","doi":"10.1145/3184558.3193135","DOIUrl":null,"url":null,"abstract":"In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.","PeriodicalId":235572,"journal":{"name":"Companion Proceedings of the The Web Conference 2018","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the The Web Conference 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3184558.3193135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.