基于 KM-LSH 融合算法和改进的 BTM 模型的主题检测方法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Soft Computing Pub Date : 2024-08-07 DOI:10.1007/s00500-024-09874-x

Wenjun Liu, Huan Guo, Jiaxin Gan, Hai Wang, Hailan Wang, Chao Zhang, Qingcheng Peng, Yuyan Sun, Bao Yu, Mengshu Hou, Bo Li, Xiaolei Li

{"title":"基于 KM-LSH 融合算法和改进的 BTM 模型的主题检测方法","authors":"Wenjun Liu, Huan Guo, Jiaxin Gan, Hai Wang, Hailan Wang, Chao Zhang, Qingcheng Peng, Yuyan Sun, Bao Yu, Mengshu Hou, Bo Li, Xiaolei Li","doi":"10.1007/s00500-024-09874-x","DOIUrl":null,"url":null,"abstract":"<p>Topic detection is an information processing technology designed to help people deal with the growing problem of data information on the Internet. In the research literature, topic detection methods are used for topic classification through word embedding, supervised-based and unsupervised-based approaches. However, most methods for topic detection only address the problem of clustering and do not focus on the problem of topic detection accuracy reduction due to the cohesiveness of topics. Also, the sequence of biterm during topic detection can cause substantial deviations in the detected topic content. To solve the above problems, this paper proposes a topic detection method based on KM-LSH fusion algorithm and improved BTM model. KM-LSH fusion algorithm is a fusion algorithm that combines K-means clustering and LSH refinement clustering. The proposed method can solve the problem of cohesiveness of topic detection, and the improved BTM model can solve the influence of the sequence of biterm on topic detection. First, the text vector is constructed by processing the collected set of microblog texts using text preprocessing methods. Secondly, the KM-LSH fusion algorithm is used to calculate text similarity and perform topic clustering and refinement. Finally, the improved BTM model is used to model the texts, which is combined with the word position and the improved TF-IDF weight calculation algorithm to adjust the microblogging texts in clustering. The experiment results indicate that the proposed KM-LSH-IBTM method improves the evaluation indexes compared with the other three topic detection methods. In conclusion, the proposed KM-LSH-IBTM method promotes the processing capability of topic detection in terms of cohesiveness and the sequence of biterm.</p>","PeriodicalId":22039,"journal":{"name":"Soft Computing","volume":"11 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A topic detection method based on KM-LSH Fusion algorithm and improved BTM model\",\"authors\":\"Wenjun Liu, Huan Guo, Jiaxin Gan, Hai Wang, Hailan Wang, Chao Zhang, Qingcheng Peng, Yuyan Sun, Bao Yu, Mengshu Hou, Bo Li, Xiaolei Li\",\"doi\":\"10.1007/s00500-024-09874-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Topic detection is an information processing technology designed to help people deal with the growing problem of data information on the Internet. In the research literature, topic detection methods are used for topic classification through word embedding, supervised-based and unsupervised-based approaches. However, most methods for topic detection only address the problem of clustering and do not focus on the problem of topic detection accuracy reduction due to the cohesiveness of topics. Also, the sequence of biterm during topic detection can cause substantial deviations in the detected topic content. To solve the above problems, this paper proposes a topic detection method based on KM-LSH fusion algorithm and improved BTM model. KM-LSH fusion algorithm is a fusion algorithm that combines K-means clustering and LSH refinement clustering. The proposed method can solve the problem of cohesiveness of topic detection, and the improved BTM model can solve the influence of the sequence of biterm on topic detection. First, the text vector is constructed by processing the collected set of microblog texts using text preprocessing methods. Secondly, the KM-LSH fusion algorithm is used to calculate text similarity and perform topic clustering and refinement. Finally, the improved BTM model is used to model the texts, which is combined with the word position and the improved TF-IDF weight calculation algorithm to adjust the microblogging texts in clustering. The experiment results indicate that the proposed KM-LSH-IBTM method improves the evaluation indexes compared with the other three topic detection methods. In conclusion, the proposed KM-LSH-IBTM method promotes the processing capability of topic detection in terms of cohesiveness and the sequence of biterm.</p>\",\"PeriodicalId\":22039,\"journal\":{\"name\":\"Soft Computing\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00500-024-09874-x\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00500-024-09874-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

主题检测是一种信息处理技术，旨在帮助人们处理互联网上日益增多的数据信息问题。在研究文献中，主题检测方法主要用于通过词嵌入、基于监督和基于非监督的方法进行主题分类。然而，大多数主题检测方法只解决了聚类问题，并没有关注主题的内聚性导致的主题检测准确率降低问题。此外，主题检测过程中的比特序列也会导致检测到的主题内容出现较大偏差。为了解决上述问题，本文提出了一种基于 KM-LSH 融合算法和改进 BTM 模型的话题检测方法。KM-LSH 融合算法是一种将 K-means 聚类和 LSH 细化聚类相结合的融合算法。所提出的方法可以解决主题检测的内聚性问题，改进的 BTM 模型可以解决 biterm 序列对主题检测的影响。首先，利用文本预处理方法处理收集到的微博文本集，构建文本向量。其次，利用 KM-LSH 融合算法计算文本相似度，并进行话题聚类和细化。最后，利用改进的 BTM 模型对文本进行建模，结合词位和改进的 TF-IDF 权重计算算法对微博文本进行聚类调整。实验结果表明，与其他三种话题检测方法相比，所提出的 KM-LSH-IBTM 方法提高了评价指标。总之，本文提出的 KM-LSH-IBTM 方法在内聚性和位序方面提高了话题检测的处理能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A topic detection method based on KM-LSH Fusion algorithm and improved BTM model

查看原文本刊更多论文

A topic detection method based on KM-LSH Fusion algorithm and improved BTM model

Topic detection is an information processing technology designed to help people deal with the growing problem of data information on the Internet. In the research literature, topic detection methods are used for topic classification through word embedding, supervised-based and unsupervised-based approaches. However, most methods for topic detection only address the problem of clustering and do not focus on the problem of topic detection accuracy reduction due to the cohesiveness of topics. Also, the sequence of biterm during topic detection can cause substantial deviations in the detected topic content. To solve the above problems, this paper proposes a topic detection method based on KM-LSH fusion algorithm and improved BTM model. KM-LSH fusion algorithm is a fusion algorithm that combines K-means clustering and LSH refinement clustering. The proposed method can solve the problem of cohesiveness of topic detection, and the improved BTM model can solve the influence of the sequence of biterm on topic detection. First, the text vector is constructed by processing the collected set of microblog texts using text preprocessing methods. Secondly, the KM-LSH fusion algorithm is used to calculate text similarity and perform topic clustering and refinement. Finally, the improved BTM model is used to model the texts, which is combined with the word position and the improved TF-IDF weight calculation algorithm to adjust the microblogging texts in clustering. The experiment results indicate that the proposed KM-LSH-IBTM method improves the evaluation indexes compared with the other three topic detection methods. In conclusion, the proposed KM-LSH-IBTM method promotes the processing capability of topic detection in terms of cohesiveness and the sequence of biterm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Soft Computing 工程技术-计算机：跨学科应用

CiteScore

8.10

自引率

9.80%

发文量

927

审稿时长

7.3 months

期刊介绍： Soft Computing is dedicated to system solutions based on soft computing techniques. It provides rapid dissemination of important results in soft computing technologies, a fusion of research in evolutionary algorithms and genetic programming, neural science and neural net systems, fuzzy set theory and fuzzy systems, and chaos theory and chaotic systems. Soft Computing encourages the integration of soft computing techniques and tools into both everyday and advanced applications. By linking the ideas and techniques of soft computing with other disciplines, the journal serves as a unifying platform that fosters comparisons, extensions, and new applications. As a result, the journal is an international forum for all scientists and engineers engaged in research and development in this fast growing field.