通过提示进行非对称短文聚类

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

New Generation Computing Pub Date : 2024-02-19 DOI:10.1007/s00354-024-00244-7

Zhi Wang, Yi Zhu, Yun Li, Jipeng Qiang, Yunhao Yuan, Chaowei Zhang

{"title":"通过提示进行非对称短文聚类","authors":"Zhi Wang, Yi Zhu, Yun Li, Jipeng Qiang, Yunhao Yuan, Chaowei Zhang","doi":"10.1007/s00354-024-00244-7","DOIUrl":null,"url":null,"abstract":"<p>Short-text clustering, which has attracted much attention with the rapid development of social media in recent decades, is a great challenge due to the feature sparsity, high ambiguity, and massive quantity. Recently, pre-trained language models (PLMs)-based methods have achieved fairly good results on this task. However, two main problems still hang in the air: (1) the significant gap of objective forms in pretraining and fine-tuning, which restricts taking full advantage of knowledge in PLMs. (2) Most existing methods require a post-processing operation for clustering label learning, potentially leading to label estimation errors for different data distributions. To address these problems, in this paper, we propose an Asymmetric Short-Text Clustering via Prompt (short for ASTCP), the features learned with our ASTCP are denser and constricted for clustering. Specifically, a subset text of the corpus is first selected by an asymmetric prompt-tuning network, which aims to obtain predicted label as a clustering center. Then, by the propagation of predicted-label information, a fine-tuned model is designed for representation learning. Thus, a clustering module, such as K-means, is built to directly output clustering labels on top of these representations. Extensive experiments conducted on three datasets have demonstrated that our ASTCP can significantly and consistently outperform other SOTA clustering methods. The source code is available at https://github.com/zhuyi_yzu/ASTCP.</p>","PeriodicalId":54726,"journal":{"name":"New Generation Computing","volume":"34 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Asymmetric Short-Text Clustering via Prompt\",\"authors\":\"Zhi Wang, Yi Zhu, Yun Li, Jipeng Qiang, Yunhao Yuan, Chaowei Zhang\",\"doi\":\"10.1007/s00354-024-00244-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Short-text clustering, which has attracted much attention with the rapid development of social media in recent decades, is a great challenge due to the feature sparsity, high ambiguity, and massive quantity. Recently, pre-trained language models (PLMs)-based methods have achieved fairly good results on this task. However, two main problems still hang in the air: (1) the significant gap of objective forms in pretraining and fine-tuning, which restricts taking full advantage of knowledge in PLMs. (2) Most existing methods require a post-processing operation for clustering label learning, potentially leading to label estimation errors for different data distributions. To address these problems, in this paper, we propose an Asymmetric Short-Text Clustering via Prompt (short for ASTCP), the features learned with our ASTCP are denser and constricted for clustering. Specifically, a subset text of the corpus is first selected by an asymmetric prompt-tuning network, which aims to obtain predicted label as a clustering center. Then, by the propagation of predicted-label information, a fine-tuned model is designed for representation learning. Thus, a clustering module, such as K-means, is built to directly output clustering labels on top of these representations. Extensive experiments conducted on three datasets have demonstrated that our ASTCP can significantly and consistently outperform other SOTA clustering methods. The source code is available at https://github.com/zhuyi_yzu/ASTCP.</p>\",\"PeriodicalId\":54726,\"journal\":{\"name\":\"New Generation Computing\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"New Generation Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00354-024-00244-7\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"New Generation Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00354-024-00244-7","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

近几十年来，随着社交媒体的快速发展，短文聚类备受关注，但由于其特征稀疏、模糊性高、数量庞大等特点，短文聚类是一项巨大的挑战。最近，基于预训练语言模型（PLMs）的方法在这项任务上取得了相当不错的效果。然而，有两个主要问题仍然悬而未决：(1) 预训练和微调的客观形式差距很大，限制了对 PLM 中知识的充分利用。(2）大多数现有方法需要对聚类标签学习进行后处理操作，有可能导致不同数据分布下的标签估计误差。针对这些问题，我们在本文中提出了一种通过提示进行非对称短文聚类的方法（简称 ASTCP）。具体来说，首先通过非对称提示调谐网络选择语料库中的一个子集文本，以获得预测标签作为聚类中心。然后，通过预测标签信息的传播，设计一个微调模型进行表征学习。这样，一个聚类模块（如 K-means）就能在这些表征的基础上直接输出聚类标签。在三个数据集上进行的广泛实验表明，我们的 ASTCP 可以显著、稳定地超越其他 SOTA 聚类方法。源代码见 https://github.com/zhuyi_yzu/ASTCP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Asymmetric Short-Text Clustering via Prompt

查看原文本刊更多论文

Asymmetric Short-Text Clustering via Prompt

Short-text clustering, which has attracted much attention with the rapid development of social media in recent decades, is a great challenge due to the feature sparsity, high ambiguity, and massive quantity. Recently, pre-trained language models (PLMs)-based methods have achieved fairly good results on this task. However, two main problems still hang in the air: (1) the significant gap of objective forms in pretraining and fine-tuning, which restricts taking full advantage of knowledge in PLMs. (2) Most existing methods require a post-processing operation for clustering label learning, potentially leading to label estimation errors for different data distributions. To address these problems, in this paper, we propose an Asymmetric Short-Text Clustering via Prompt (short for ASTCP), the features learned with our ASTCP are denser and constricted for clustering. Specifically, a subset text of the corpus is first selected by an asymmetric prompt-tuning network, which aims to obtain predicted label as a clustering center. Then, by the propagation of predicted-label information, a fine-tuned model is designed for representation learning. Thus, a clustering module, such as K-means, is built to directly output clustering labels on top of these representations. Extensive experiments conducted on three datasets have demonstrated that our ASTCP can significantly and consistently outperform other SOTA clustering methods. The source code is available at https://github.com/zhuyi_yzu/ASTCP.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

New Generation Computing 工程技术-计算机：理论方法

CiteScore

5.90

自引率

15.40%

发文量

审稿时长

>12 weeks

期刊介绍： The journal is specially intended to support the development of new computational and cognitive paradigms stemming from the cross-fertilization of various research fields. These fields include, but are not limited to, programming (logic, constraint, functional, object-oriented), distributed/parallel computing, knowledge-based systems, agent-oriented systems, and cognitive aspects of human embodied knowledge. It also encourages theoretical and/or practical papers concerning all types of learning, knowledge discovery, evolutionary mechanisms, human cognition and learning, and emergent systems that can lead to key technologies enabling us to build more complex and intelligent systems. The editorial board hopes that New Generation Computing will work as a catalyst among active researchers with broad interests by ensuring a smooth publication process.