Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion

IF 5.3 2区计算机科学

Computational Linguistics Pub Date : 2024-01-08 DOI:10.1162/coli_a_00506

Anton Thielmann, Arik Reuter, Quentin Seifert, Elisabeth Bergherr, Benjamin Säfken

{"title":"Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion","authors":"Anton Thielmann, Arik Reuter, Quentin Seifert, Elisabeth Bergherr, Benjamin Säfken","doi":"10.1162/coli_a_00506","DOIUrl":null,"url":null,"abstract":"Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"14 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00506","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. Through simple corpus expansion, our model can detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.

查看原文本刊更多论文

干草堆中的话题：通过语料库扩展提高主题质量

在自然语言处理（NLP）领域，提取和识别大型文本语料库中的潜在主题越来越重要。大多数模型，无论是类似于潜在 Dirichlet 分配（LDA）的概率模型，还是神经主题模型，都遵循相同的主题可解释性和主题提取的基本方法。我们提出的方法结合了对句子和文档主题的更深入理解，并超越了简单分析数据中单词频率的范畴。通过简单的语料库扩展，我们的模型可以检测到潜在的主题，其中可能包括不常见的词或新词，以及文档本身不存在的词。此外，我们还根据语义空间中的入侵词和相似度量提出了几个新的评估指标。我们提出了与人类识别入侵词的相关系数，并在单词入侵任务中取得了接近人类水平的结果。我们通过一项大型基准研究证明了我们的方法极具竞争力，与最先进的主题建模和文档聚类模型相比，我们的方法取得了更优异的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.