Dirichlet过程多项混合模型在短文本主题建模中的评价

2018 6th International Symposium on Computational and Business Intelligence (ISCBI) Pub Date : 2018-08-01 DOI:10.1109/ISCBI.2018.00025

Alexander Karlsson, Denio Duarte, G. Mathiason, Juhee Bae

{"title":"Dirichlet过程多项混合模型在短文本主题建模中的评价","authors":"Alexander Karlsson, Denio Duarte, G. Mathiason, Juhee Bae","doi":"10.1109/ISCBI.2018.00025","DOIUrl":null,"url":null,"abstract":"Fast-moving trends, both in society and in highly competitive business areas, call for effective methods for automatic analysis. The availability of fast-moving sources in the form of short texts, such as social media and blogs, allows aggregation from a vast number of text sources, for an up to date view of trends and business insights. Topic modeling is established as an approach for analysis of large amounts of texts, but the scarcity of statistical information in short texts is considered to be a major problem for obtaining reliable topics from traditional models such as LDA. A range of different specialized topic models have been proposed, but a majority of these approaches rely on rather strong parametric assumptions, such as setting a fixed number of topics. In contrast, recent advances in the field of Bayesian non-parametrics suggest the Dirichlet process as a method that, given certain hyper-parameters, can self-adapt to the number of topics of the data at hand. We perform an empirical evaluation of the Dirichlet process multinomial (unigram) mixture model against several parametric topic models, initialized with different number of topics. The resulting models are evaluated, using both direct and indirect measures that have been found to correlate well with human topic rankings. We show that the Dirichlet Process Multinomial Mixture model is a viable option for short text topic modeling since it on average performs better, or nearly as good, compared to the parametric alternatives, while reducing parameter setting requirements and thereby eliminates the need of expensive preprocessing.","PeriodicalId":153800,"journal":{"name":"2018 6th International Symposium on Computational and Business Intelligence (ISCBI)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Evaluation of the Dirichlet Process Multinomial Mixture Model for Short-Text Topic Modeling\",\"authors\":\"Alexander Karlsson, Denio Duarte, G. Mathiason, Juhee Bae\",\"doi\":\"10.1109/ISCBI.2018.00025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fast-moving trends, both in society and in highly competitive business areas, call for effective methods for automatic analysis. The availability of fast-moving sources in the form of short texts, such as social media and blogs, allows aggregation from a vast number of text sources, for an up to date view of trends and business insights. Topic modeling is established as an approach for analysis of large amounts of texts, but the scarcity of statistical information in short texts is considered to be a major problem for obtaining reliable topics from traditional models such as LDA. A range of different specialized topic models have been proposed, but a majority of these approaches rely on rather strong parametric assumptions, such as setting a fixed number of topics. In contrast, recent advances in the field of Bayesian non-parametrics suggest the Dirichlet process as a method that, given certain hyper-parameters, can self-adapt to the number of topics of the data at hand. We perform an empirical evaluation of the Dirichlet process multinomial (unigram) mixture model against several parametric topic models, initialized with different number of topics. The resulting models are evaluated, using both direct and indirect measures that have been found to correlate well with human topic rankings. We show that the Dirichlet Process Multinomial Mixture model is a viable option for short text topic modeling since it on average performs better, or nearly as good, compared to the parametric alternatives, while reducing parameter setting requirements and thereby eliminates the need of expensive preprocessing.\",\"PeriodicalId\":153800,\"journal\":{\"name\":\"2018 6th International Symposium on Computational and Business Intelligence (ISCBI)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 6th International Symposium on Computational and Business Intelligence (ISCBI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISCBI.2018.00025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 6th International Symposium on Computational and Business Intelligence (ISCBI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCBI.2018.00025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在社会和竞争激烈的商业领域中，快速发展的趋势需要有效的自动分析方法。短文本形式的快速移动源的可用性，如社交媒体和博客，允许从大量文本源聚合，以获得最新的趋势视图和业务见解。主题建模是一种用于分析大量文本的方法，但短文本中统计信息的稀缺性被认为是从LDA等传统模型中获得可靠主题的主要问题。已经提出了一系列不同的专门主题模型，但这些方法中的大多数依赖于相当强的参数假设，例如设置固定数量的主题。相比之下，贝叶斯非参数领域的最新进展表明，Dirichlet过程作为一种方法，在给定某些超参数的情况下，可以自适应手头数据的主题数量。我们对几个参数主题模型进行了狄利克雷过程多项(一元)混合模型的经验评估，这些模型初始化了不同数量的主题。使用已发现与人类主题排名密切相关的直接和间接措施来评估所得模型。我们表明，Dirichlet过程多项混合模型是短文本主题建模的可行选择，因为与参数替代方案相比，它的平均性能更好，或者几乎一样好，同时减少了参数设置要求，从而消除了昂贵的预处理需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of the Dirichlet Process Multinomial Mixture Model for Short-Text Topic Modeling

Fast-moving trends, both in society and in highly competitive business areas, call for effective methods for automatic analysis. The availability of fast-moving sources in the form of short texts, such as social media and blogs, allows aggregation from a vast number of text sources, for an up to date view of trends and business insights. Topic modeling is established as an approach for analysis of large amounts of texts, but the scarcity of statistical information in short texts is considered to be a major problem for obtaining reliable topics from traditional models such as LDA. A range of different specialized topic models have been proposed, but a majority of these approaches rely on rather strong parametric assumptions, such as setting a fixed number of topics. In contrast, recent advances in the field of Bayesian non-parametrics suggest the Dirichlet process as a method that, given certain hyper-parameters, can self-adapt to the number of topics of the data at hand. We perform an empirical evaluation of the Dirichlet process multinomial (unigram) mixture model against several parametric topic models, initialized with different number of topics. The resulting models are evaluated, using both direct and indirect measures that have been found to correlate well with human topic rankings. We show that the Dirichlet Process Multinomial Mixture model is a viable option for short text topic modeling since it on average performs better, or nearly as good, compared to the parametric alternatives, while reducing parameter setting requirements and thereby eliminates the need of expensive preprocessing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 6th International Symposium on Computational and Business Intelligence (ISCBI)

自引率

0.00%

发文量