{"title":"Enhancing neural topic modeling for social media text via semantic bag of word clusters and log-domain Sinkhorn transport","authors":"Yi Sun, Junhao Zhao, Haoran Xu, Ronghua Zhang, Changzheng Liu, Limengzi Yuan","doi":"10.1016/j.ipm.2025.104411","DOIUrl":null,"url":null,"abstract":"<div><div>Topic modeling has been widely applied to analyze text data from social media platforms. Under this scenario, traditional Neural Topic Models (NTMs) encounter three primary challenges: (1) initial text representation; (2) the long-tail nature of topic distributions in social network texts; (3) approximation of Optimal Transport. Motivated by these challenges, we propose an end-to-end solution spanning from text representation to topic modeling.</div><div>First, we propose SBoWC, a novel text representation method that performs dimensionality reduction while absorbing semantic information through base terms, achieved by combining word embeddings with clustering statistics. Subsequently, we propose GSWTM, a Wasserstein-based autoencoder topic model that fits the long-tail topic distribution in social network texts via Gamma priors and innovatively employs log-domain Sinkhorn to approximate Optimal Transport.</div><div>Ablation studies demonstrate the transferability and effectiveness of SBoWC in text representation. GSWTM demonstrates significantly better performance than baselines in TU, <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>V</mi></mrow></msub></math></span>, and the comprehensive metrics TQ across four real social network datasets of varying sizes. The log-domain Sinkhorn approximation exhibits excellent stability, allowing the regularization parameter <span><math><mi>ϵ</mi></math></span> to be reduced to 0.1–0.01, thereby approaching the original Optimal Transport.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104411"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003528","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Topic modeling has been widely applied to analyze text data from social media platforms. Under this scenario, traditional Neural Topic Models (NTMs) encounter three primary challenges: (1) initial text representation; (2) the long-tail nature of topic distributions in social network texts; (3) approximation of Optimal Transport. Motivated by these challenges, we propose an end-to-end solution spanning from text representation to topic modeling.
First, we propose SBoWC, a novel text representation method that performs dimensionality reduction while absorbing semantic information through base terms, achieved by combining word embeddings with clustering statistics. Subsequently, we propose GSWTM, a Wasserstein-based autoencoder topic model that fits the long-tail topic distribution in social network texts via Gamma priors and innovatively employs log-domain Sinkhorn to approximate Optimal Transport.
Ablation studies demonstrate the transferability and effectiveness of SBoWC in text representation. GSWTM demonstrates significantly better performance than baselines in TU, , and the comprehensive metrics TQ across four real social network datasets of varying sizes. The log-domain Sinkhorn approximation exhibits excellent stability, allowing the regularization parameter to be reduced to 0.1–0.01, thereby approaching the original Optimal Transport.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.