On collocations and topic models

ACM Trans. Speech Lang. Process. Pub Date : 2013-07-01 DOI:10.1145/2483969.2483972

Jey Han Lau, Timothy Baldwin, D. Newman

{"title":"On collocations and topic models","authors":"Jey Han Lau, Timothy Baldwin, D. Newman","doi":"10.1145/2483969.2483972","DOIUrl":null,"url":null,"abstract":"We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n-gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.","PeriodicalId":412532,"journal":{"name":"ACM Trans. Speech Lang. Process.","volume":"343 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"63","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Speech Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2483969.2483972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 63

Abstract

We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n-gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.

查看原文本刊更多论文

关于搭配和主题模型

我们研究了预提取和标记双元组合对主题模型的影响。通过对四种不同语料库的广泛实验，我们表明，在文档表示中加入二元搭配可以创建更简洁的模型，并提高主题一致性。我们指出了在解释测试似然度和测试困惑度来比较模型拟合时存在的一些问题，并提出了一种惩罚模型复杂性的替代措施。我们展示了Akaike信息标准是一个更合适的度量，它表明使用适度数量(最多1000个)排名靠前的双元图是最佳的主题建模配置。使用这1000个grams也比使用单个gram标记提高了主题质量。主题质量的进一步提高可以通过使用多达10,000个grams来实现，但这是以更复杂的模型为代价的。我们还展示了多字(双字和更长的)命名实体给出一致的结果，表明它们应该被表示为单个令牌。这是第一个明确研究n-gram标记化对LDA主题模型影响的工作，也是第一个向主题建模从业者提出经验建议的工作，挑战了基于ungram的标记化的标准实践。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Speech Lang. Process.

自引率

0.00%

发文量