话题噪声模型:社交媒体帖子集合中的话题和噪声分布建模

Rob Churchill, Lisa Singh
{"title":"话题噪声模型:社交媒体帖子集合中的话题和噪声分布建模","authors":"Rob Churchill, Lisa Singh","doi":"10.1109/ICDM51629.2021.00017","DOIUrl":null,"url":null,"abstract":"Most topic models define a document as a mixture of topics and each topic as a mixture of words. Generally, the difference in generative topic models is how these mixtures of topics are generated. We propose looking at topic models in a new way, as topic-noise models. Our topic-noise model defines a document as a mixture of topics and noise. Topic Noise Discriminator (TND) estimates both the topic and noise distributions using not only the relationships between words in documents, but also the linguistic relationships found using word embeddings. This type of model is important for short, sparse social media posts that contain both random and non-random noise. We also understand that topic quality is subjective and that researchers may have preferences. Therefore, we propose a variant of our model that combines the pre-trained noise distribution from TND in an ensemble with any generative topic model to filter noise words and produce more coherent and diverse topic sets. We present this approach using Latent Dirichlet Allocation (LDA) and show that it is effective for maintaining high quality LDA topics while removing noise within them. Finally, we show the value of using a context-specific noise list generated from TND to remove noise statically, after topics have been generated by any topic model, including non-generative ones. We demonstrate the effectiveness of all three of these approaches that explicitly model context-specific noise in document collections.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections\",\"authors\":\"Rob Churchill, Lisa Singh\",\"doi\":\"10.1109/ICDM51629.2021.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most topic models define a document as a mixture of topics and each topic as a mixture of words. Generally, the difference in generative topic models is how these mixtures of topics are generated. We propose looking at topic models in a new way, as topic-noise models. Our topic-noise model defines a document as a mixture of topics and noise. Topic Noise Discriminator (TND) estimates both the topic and noise distributions using not only the relationships between words in documents, but also the linguistic relationships found using word embeddings. This type of model is important for short, sparse social media posts that contain both random and non-random noise. We also understand that topic quality is subjective and that researchers may have preferences. Therefore, we propose a variant of our model that combines the pre-trained noise distribution from TND in an ensemble with any generative topic model to filter noise words and produce more coherent and diverse topic sets. We present this approach using Latent Dirichlet Allocation (LDA) and show that it is effective for maintaining high quality LDA topics while removing noise within them. Finally, we show the value of using a context-specific noise list generated from TND to remove noise statically, after topics have been generated by any topic model, including non-generative ones. We demonstrate the effectiveness of all three of these approaches that explicitly model context-specific noise in document collections.\",\"PeriodicalId\":320970,\"journal\":{\"name\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Data Mining (ICDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM51629.2021.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

大多数主题模型将文档定义为主题的混合,每个主题定义为单词的混合。一般来说,生成主题模型的不同之处在于如何生成这些主题的混合。我们建议以一种新的方式看待主题模型,即主题噪声模型。我们的主题噪声模型将文档定义为主题和噪声的混合物。主题噪声判别器(TND)不仅利用文档中单词之间的关系来估计主题和噪声分布,而且利用词嵌入发现的语言关系来估计主题和噪声分布。这种类型的模型对于包含随机和非随机噪声的简短、稀疏的社交媒体帖子非常重要。我们也明白,主题质量是主观的,研究人员可能有自己的偏好。因此,我们提出了一种模型的变体,该模型将集成中TND的预训练噪声分布与任何生成主题模型相结合,以过滤噪声词并产生更连贯和多样化的主题集。我们使用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)提出了这种方法,并证明了在保持高质量的LDA主题的同时去除其中的噪声是有效的。最后,我们展示了在任何主题模型(包括非生成模型)生成主题后,使用由TND生成的特定于上下文的噪声列表来静态地去除噪声的价值。我们展示了这三种方法的有效性,它们显式地对文档集合中特定于上下文的噪声进行建模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections
Most topic models define a document as a mixture of topics and each topic as a mixture of words. Generally, the difference in generative topic models is how these mixtures of topics are generated. We propose looking at topic models in a new way, as topic-noise models. Our topic-noise model defines a document as a mixture of topics and noise. Topic Noise Discriminator (TND) estimates both the topic and noise distributions using not only the relationships between words in documents, but also the linguistic relationships found using word embeddings. This type of model is important for short, sparse social media posts that contain both random and non-random noise. We also understand that topic quality is subjective and that researchers may have preferences. Therefore, we propose a variant of our model that combines the pre-trained noise distribution from TND in an ensemble with any generative topic model to filter noise words and produce more coherent and diverse topic sets. We present this approach using Latent Dirichlet Allocation (LDA) and show that it is effective for maintaining high quality LDA topics while removing noise within them. Finally, we show the value of using a context-specific noise list generated from TND to remove noise statically, after topics have been generated by any topic model, including non-generative ones. We demonstrate the effectiveness of all three of these approaches that explicitly model context-specific noise in document collections.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信