Privacy Text Clustering Method Based on Burst Feature of Words

IF 1.5 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Concurrency and Computation-Practice & Experience Pub Date : 2025-09-03 DOI:10.1002/cpe.70269

Xia Wu, Zehan Li, Yong Wang, Qing Zhao, Ke Wang, Hangyu Hu

{"title":"Privacy Text Clustering Method Based on Burst Feature of Words","authors":"Xia Wu, Zehan Li, Yong Wang, Qing Zhao, Ke Wang, Hangyu Hu","doi":"10.1002/cpe.70269","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Real-time detection of privacy-relevant events in social media faces two fundamental challenges: (1) cluster instability caused by sparse and noisy text data, which leads to center drift; and (2) poor event discernibility in traditional online clustering methods. These limitations severely impair effective privacy monitoring in dynamic social media environments. To address these challenges, we propose an innovative edge intelligence-driven framework that integrates adaptive burst word detection using wavelet-based signal analysis; spectral clustering of identified burst words to establish stable event anchors; and real-time incremental text clustering centered around these fixed anchors. We conduct a comprehensive evaluation on a dataset of 116 million COVID-19-related tweets and obtain the following results: Burst word identification accuracy of 86.28%; cluster purity of 0.875 (37% improvement over the baseline method); throughput of 3000 tweets per minute; and 78% reduction of irrelevant content through effective noise filtering. The key advantages of our approach include: Addressing the persistent cluster drift problem via burst anchoring centers; enabling efficient distributed processing via edge intelligence architecture; providing a practical and scalable solution for real-time social media monitoring; and establishing a new paradigm for privacy-aware event detection systems.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 23-24","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70269","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Real-time detection of privacy-relevant events in social media faces two fundamental challenges: (1) cluster instability caused by sparse and noisy text data, which leads to center drift; and (2) poor event discernibility in traditional online clustering methods. These limitations severely impair effective privacy monitoring in dynamic social media environments. To address these challenges, we propose an innovative edge intelligence-driven framework that integrates adaptive burst word detection using wavelet-based signal analysis; spectral clustering of identified burst words to establish stable event anchors; and real-time incremental text clustering centered around these fixed anchors. We conduct a comprehensive evaluation on a dataset of 116 million COVID-19-related tweets and obtain the following results: Burst word identification accuracy of 86.28%; cluster purity of 0.875 (37% improvement over the baseline method); throughput of 3000 tweets per minute; and 78% reduction of irrelevant content through effective noise filtering. The key advantages of our approach include: Addressing the persistent cluster drift problem via burst anchoring centers; enabling efficient distributed processing via edge intelligence architecture; providing a practical and scalable solution for real-time social media monitoring; and establishing a new paradigm for privacy-aware event detection systems.

查看原文本刊更多论文

基于词突发特征的隐私文本聚类方法

社交媒体中隐私相关事件的实时检测面临两个根本性的挑战：(1)文本数据稀疏和噪声导致聚类不稳定，导致中心漂移；(2)传统在线聚类方法的事件可分辨性差。这些限制严重损害了动态社交媒体环境中有效的隐私监控。为了应对这些挑战，我们提出了一种创新的边缘智能驱动框架，该框架集成了使用基于小波的信号分析的自适应突发词检测；对识别出的突发词进行谱聚类，建立稳定的事件锚点实时增量文本聚类围绕这些固定锚。我们对1.16亿条与covid -19相关的推文数据集进行综合评估，得到以下结果：突发词识别准确率为86.28%；聚类纯度为0.875（比基线方法提高37%）；吞吐量为每分钟3000条tweet；并通过有效的噪声滤波将无关内容减少78%。该方法的主要优点包括：通过突发锚定中心解决持续簇漂移问题；通过边缘智能架构实现高效的分布式处理；为实时社交媒体监控提供实用且可扩展的解决方案；并为隐私感知事件检测系统建立一个新的范例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Concurrency and Computation-Practice & Experience 工程技术-计算机：理论方法

CiteScore

5.00

自引率

10.00%

发文量

664

审稿时长

9.6 months

期刊介绍： Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of: Parallel and distributed computing; High-performance computing; Computational and data science; Artificial intelligence and machine learning; Big data applications, algorithms, and systems; Network science; Ontologies and semantics; Security and privacy; Cloud/edge/fog computing; Green computing; and Quantum computing.