Scalable inference in max-margin topic models

Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang
{"title":"Scalable inference in max-margin topic models","authors":"Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang","doi":"10.1145/2487575.2487658","DOIUrl":null,"url":null,"abstract":"Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse\" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2487575.2487658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.
最大边际主题模型中的可伸缩推理
主题模型在分析大型复杂数据集方面发挥了关键作用。除了发现潜在语义外,监督主题模型(STMs)还可以对未知的测试数据进行预测。通过与先进的学习技术相结合,stm的预测能力得到了极大的增强,例如最大边际监督主题模型,将最大边际学习与主题模型相结合的最先进的方法。虽然功能强大,但最大边际stm存在一个困难的非平滑学习问题。现有算法依赖于在em型过程中求解多个潜在的SVM子问题,速度太慢,无法适用于大规模的分类任务。在本文中,我们提出了一种高度可扩展的方法来构建最大边际监督主题模型。我们的方法建立在三个关键创新的基础上:1)针对多类和多标签分类的Gibbs最大边际监督主题模型的新公式;2)一种简单的“扩充-坍缩”吉布斯抽样算法,不对后验分布作限制性假设;3)一种高效的并行实现,可以轻松处理包含数百个类别和数百万个文档的数据集。此外,我们的算法不需要解决支持向量机的子问题。虽然我们的方法将主题发现和学习预测模型两项任务联合起来执行,大大提高了分类性能,但我们的方法具有与仅执行单一主题发现任务的标准LDA主题模型的最先进并行算法相当的可扩展性。最后,还提供了一个开源实现:http://www.ml-thu.net/~jun/medlda。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信