Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs

Anshumali Shrivastava
{"title":"Training 100,000 Classes on a Single Titan X in 7 Hours or 15 Minutes with 25 Titan Xs","authors":"Anshumali Shrivastava","doi":"10.1145/3184558.3193135","DOIUrl":null,"url":null,"abstract":"In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.","PeriodicalId":235572,"journal":{"name":"Companion Proceedings of the The Web Conference 2018","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the The Web Conference 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3184558.3193135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this talk, I will present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require $O(Kd)$ memory and inference cost, MACH only need $O(dłogK)$ (d is dimensionality) memory while only requiring $O(KłogK + dłogK )$ operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few (logarithmic many) independent classification tasks with small (constant) number of classes. I will show the first quantification of discriminability-memory tradeoff in multi-class classification. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of feature extracted fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. Furthermore, the algorithm is trivially parallelizable. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.
在7小时或15分钟内使用25个Titan X在单个Titan X上训练100,000个课程
在这次演讲中,我将介绍通过哈希(MACH)进行k分类的合并平均分类器,该分类器具有超大k值,与需要$O(Kd)$内存和推理成本的传统一对一分类器相比,MACH只需要$O(dłogK)$ (d是维度)内存,而推理只需要$O(KłogK + dłogK)$操作。MACH是一种通用的k分类算法,具有可证明的理论保证,不需要对类之间的关系做任何假设。MACH使用通用哈希将具有大量类的分类减少到具有少量(常数)类的独立分类任务(对数多)。我将展示多类分类中可判别性-内存权衡的第一个量化。使用简单的哈希思想,我们可以在单个Titan X GPU上训练100,000个类和400,000个特征的ODP数据集,分类准确率为19.28%,这是该数据集上报道的最佳准确率。在这项工作之前,表现最好的基线是一个一对一的分类器,它需要400亿个参数(160gb模型大小),并达到9%的准确率。相比之下,MACH可以在模型大小减少480倍(仅0.3GB)的情况下实现9%的精度。使用MACH,我们还演示了在单个GPU上完整训练特征提取的细粒度imagenet数据集(压缩大小为104GB),具有21,000个类。据我们所知,这是第一次在单个Titan x上演示这些极端类数据集的完整训练。此外,该算法是可并行化的。我们的实验表明,我们可以在单个GPU上在7小时内训练ODP数据集,或者在25个GPU上在15分钟内训练。类似地,我们可以在单个GPU上在24小时内训练细粒度图像数据集的分类器,使用20个GPU可以减少到1小时多一点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信