Topic Modeling of Short Texts Using Anchor Words

Florian Steuber, Mirco Schönfeld, G. Rodosek
{"title":"Topic Modeling of Short Texts Using Anchor Words","authors":"Florian Steuber, Mirco Schönfeld, G. Rodosek","doi":"10.1145/3405962.3405968","DOIUrl":null,"url":null,"abstract":"We present Archetypal LDA or short A-LDA, a topic model tailored to short texts containing \"semantic anchors\" which convey a certain meaning or implicitly build on discussions beyond their mere presence. A-LDA is an extension to Latent Dirichlet Allocation in that we guide the process of topic inference by these semantic anchors as seed words to the LDA. We identify these seed words unsupervised from the documents and evaluate their co-occurrences using archetypal analysis, a geometric approximation problem that aims for finding k points that best approximate the data set's convex hull. These so called archetypes are considered as latent topics and used to guide the LDA. We demonstrate the effectiveness of our approach using Twitter, where semantic anchor words are the hashtags assigned to tweets by users. In direct comparison to LDA, A-LDA achieves 10-13% better results. We find that representing topics in terms of hashtags corresponding to calculated archetypes alone already results in interpretable topics and the model's performance peaks for seed confidence values ranging from 0.7 to 0.9.","PeriodicalId":247414,"journal":{"name":"Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3405962.3405968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

We present Archetypal LDA or short A-LDA, a topic model tailored to short texts containing "semantic anchors" which convey a certain meaning or implicitly build on discussions beyond their mere presence. A-LDA is an extension to Latent Dirichlet Allocation in that we guide the process of topic inference by these semantic anchors as seed words to the LDA. We identify these seed words unsupervised from the documents and evaluate their co-occurrences using archetypal analysis, a geometric approximation problem that aims for finding k points that best approximate the data set's convex hull. These so called archetypes are considered as latent topics and used to guide the LDA. We demonstrate the effectiveness of our approach using Twitter, where semantic anchor words are the hashtags assigned to tweets by users. In direct comparison to LDA, A-LDA achieves 10-13% better results. We find that representing topics in terms of hashtags corresponding to calculated archetypes alone already results in interpretable topics and the model's performance peaks for seed confidence values ranging from 0.7 to 0.9.
基于锚定词的短文本主题建模
我们提出了原型LDA或简短的a -LDA,这是一个针对包含“语义锚”的短文本量身定制的主题模型,这些“语义锚”传达了一定的含义或隐含地建立在讨论之上。A-LDA是潜狄利克雷分配的扩展,我们将这些语义锚作为种子词引导到LDA的主题推理过程。我们从文档中识别这些无监督的种子词,并使用原型分析评估它们的共现性,原型分析是一个几何近似问题,旨在找到k个最接近数据集凸包的点。这些所谓的原型被认为是潜在的主题,并用于指导LDA。我们使用Twitter展示了我们的方法的有效性,其中语义锚词是用户分配给tweet的标签。与LDA直接比较,A-LDA的效果好10-13%。我们发现,仅根据计算出的原型对应的标签来表示主题已经产生了可解释的主题,并且模型的性能峰值在种子置信度范围为0.7到0.9之间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信