Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs

Companion Proceedings of the The Web Conference 2018 Pub Date : 2018-04-23 DOI:10.1145/3184558.3193135

Anshumali Shrivastava

{"title":"Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs","authors":"Anshumali Shrivastava","doi":"10.1145/3184558.3193135","DOIUrl":null,"url":null,"abstract":"In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.","PeriodicalId":235572,"journal":{"name":"Companion Proceedings of the The Web Conference 2018","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the The Web Conference 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3184558.3193135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.

查看原文本刊更多论文

在7小时或15分钟内使用25个Titan X在单个Titan X上训练100,000个课程

在这次演讲中，我将介绍通过哈希(MACH)进行k分类的合并平均分类器，该分类器具有超大k值，与需要$O(Kd)$内存和推理成本的传统一对一分类器相比，MACH只需要$O(dłogK)$ (d是维度)内存，而推理只需要$O(KłogK + dłogK)$操作。MACH是一种通用的k分类算法，具有可证明的理论保证，不需要对类之间的关系做任何假设。MACH使用通用哈希将具有大量类的分类减少到具有少量(常数)类的独立分类任务(对数多)。我将展示多类分类中可判别性-内存权衡的第一个量化。使用简单的哈希思想，我们可以在单个Titan X GPU上训练100,000个类和400,000个特征的ODP数据集，分类准确率为19.28%，这是该数据集上报道的最佳准确率。在这项工作之前，表现最好的基线是一个一对一的分类器，它需要400亿个参数(160gb模型大小)，并达到9%的准确率。相比之下，MACH可以在模型大小减少480倍(仅0.3GB)的情况下实现9%的精度。使用MACH，我们还演示了在单个GPU上完整训练特征提取的细粒度imagenet数据集(压缩大小为104GB)，具有21,000个类。据我们所知，这是第一次在单个Titan x上演示这些极端类数据集的完整训练。此外，该算法是可并行化的。我们的实验表明，我们可以在单个GPU上在7小时内训练ODP数据集，或者在25个GPU上在15分钟内训练。类似地，我们可以在单个GPU上在24小时内训练细粒度图像数据集的分类器，使用20个GPU可以减少到1小时多一点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion Proceedings of the The Web Conference 2018

自引率

0.00%

发文量