短文本连贯主题发现的多层次聚类模型

2022 IST-Africa Conference (IST-Africa) Pub Date : 2020-05-22 DOI:10.23919/IST-Africa56635.2022.9845648

Emmanuel Maithya, L. Nderu, D. Njagi

{"title":"短文本连贯主题发现的多层次聚类模型","authors":"Emmanuel Maithya, L. Nderu, D. Njagi","doi":"10.23919/IST-Africa56635.2022.9845648","DOIUrl":null,"url":null,"abstract":"Deducing meaning from collections of documents has become an increasingly important task for decision makers, both in industry and academia. To address this challenge, topic modelling techniques have been developed to identify and isolate words that most closely summarise the contents of document collections. However, the topics extracted from collections of short texts by these techniques, achieve low coherence scores, thereby defeating the purpose for which these techniques were created. In this paper, we propose the n-gram_cluster model, a model that exploits the semantic closeness between n-grams and word clusters formed from collections of the n-grams at different levels to discover topics. The model is able to discover semantically coherent topics from collections of short texts. We evaluated the performance of our model against those of three other conventional models showing that it is able to form topics that achieve comparatively higher coherence scores.","PeriodicalId":142887,"journal":{"name":"2022 IST-Africa Conference (IST-Africa)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Multilevel Clustering Model for Coherent Topic Discovery in Short Texts\",\"authors\":\"Emmanuel Maithya, L. Nderu, D. Njagi\",\"doi\":\"10.23919/IST-Africa56635.2022.9845648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deducing meaning from collections of documents has become an increasingly important task for decision makers, both in industry and academia. To address this challenge, topic modelling techniques have been developed to identify and isolate words that most closely summarise the contents of document collections. However, the topics extracted from collections of short texts by these techniques, achieve low coherence scores, thereby defeating the purpose for which these techniques were created. In this paper, we propose the n-gram_cluster model, a model that exploits the semantic closeness between n-grams and word clusters formed from collections of the n-grams at different levels to discover topics. The model is able to discover semantically coherent topics from collections of short texts. We evaluated the performance of our model against those of three other conventional models showing that it is able to form topics that achieve comparatively higher coherence scores.\",\"PeriodicalId\":142887,\"journal\":{\"name\":\"2022 IST-Africa Conference (IST-Africa)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IST-Africa Conference (IST-Africa)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/IST-Africa56635.2022.9845648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IST-Africa Conference (IST-Africa)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/IST-Africa56635.2022.9845648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

从文件集合中推断意义已经成为工业界和学术界决策者日益重要的任务。为了应对这一挑战，主题建模技术已经被开发出来，用于识别和分离最能概括文档集合内容的单词。然而，通过这些技术从短文本集合中提取的主题获得了较低的连贯分数，从而违背了创建这些技术的目的。在本文中，我们提出了n-gram_cluster模型，该模型利用n-gram和由不同层次n-gram集合形成的词簇之间的语义紧密性来发现主题。该模型能够从短文本集合中发现语义一致的主题。我们将我们的模型与其他三种传统模型的性能进行了评估，表明它能够形成获得相对较高连贯分数的主题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Multilevel Clustering Model for Coherent Topic Discovery in Short Texts

Deducing meaning from collections of documents has become an increasingly important task for decision makers, both in industry and academia. To address this challenge, topic modelling techniques have been developed to identify and isolate words that most closely summarise the contents of document collections. However, the topics extracted from collections of short texts by these techniques, achieve low coherence scores, thereby defeating the purpose for which these techniques were created. In this paper, we propose the n-gram_cluster model, a model that exploits the semantic closeness between n-grams and word clusters formed from collections of the n-grams at different levels to discover topics. The model is able to discover semantically coherent topics from collections of short texts. We evaluated the performance of our model against those of three other conventional models showing that it is able to form topics that achieve comparatively higher coherence scores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IST-Africa Conference (IST-Africa)

自引率

0.00%

发文量