A lightweight mixup-based short texts clustering for contrastive learning

IF 2.1 4区医学 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in Computational Neuroscience Pub Date : 2023-12-18 DOI:10.3389/fncom.2023.1334748

Qiang Xu, HaiBo Zan, ShengWei Ji

{"title":"A lightweight mixup-based short texts clustering for contrastive learning","authors":"Qiang Xu, HaiBo Zan, ShengWei Ji","doi":"10.3389/fncom.2023.1334748","DOIUrl":null,"url":null,"abstract":"<p>Traditional text clustering based on distance struggles to distinguish between overlapping representations in medical data. By incorporating contrastive learning, the feature space can be optimized and applies mixup implicitly during the data augmentation phase to reduce computational burden. Medical case text is prevalent in everyday life, and clustering is a fundamental method of identifying major categories of conditions within vast amounts of unlabeled text. Learning meaningful clustering scores in data relating to rare diseases is difficult due to their unique sparsity. To address this issue, we propose a contrastive clustering method based on mixup, which involves selecting a small batch of data to simulate the experimental environment of rare diseases. The contrastive learning module optimizes the feature space based on the fact that positive pairs share negative samples, and clustering is employed to group data with comparable semantic features. The module mitigates the issue of overlap in data, whilst mixup generates cost-effective virtual features, resulting in superior experiment scores even when using small batch data and reducing resource usage and time overhead. Our suggested technique has acquired cutting-edge outcomes and embodies a favorable strategy for unmonitored text clustering.</p>","PeriodicalId":12363,"journal":{"name":"Frontiers in Computational Neuroscience","volume":null,"pages":null},"PeriodicalIF":2.1000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Computational Neuroscience","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fncom.2023.1334748","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Traditional text clustering based on distance struggles to distinguish between overlapping representations in medical data. By incorporating contrastive learning, the feature space can be optimized and applies mixup implicitly during the data augmentation phase to reduce computational burden. Medical case text is prevalent in everyday life, and clustering is a fundamental method of identifying major categories of conditions within vast amounts of unlabeled text. Learning meaningful clustering scores in data relating to rare diseases is difficult due to their unique sparsity. To address this issue, we propose a contrastive clustering method based on mixup, which involves selecting a small batch of data to simulate the experimental environment of rare diseases. The contrastive learning module optimizes the feature space based on the fact that positive pairs share negative samples, and clustering is employed to group data with comparable semantic features. The module mitigates the issue of overlap in data, whilst mixup generates cost-effective virtual features, resulting in superior experiment scores even when using small batch data and reducing resource usage and time overhead. Our suggested technique has acquired cutting-edge outcomes and embodies a favorable strategy for unmonitored text clustering.

查看原文本刊更多论文

基于混合的轻量级短文聚类，用于对比学习

传统的基于距离的文本聚类难以区分医疗数据中的重叠表征。通过结合对比学习，可以优化特征空间，并在数据增强阶段隐式地应用混合，从而减轻计算负担。医疗病例文本在日常生活中非常普遍，而聚类是在大量无标记文本中识别主要病症类别的基本方法。由于罕见疾病的独特稀疏性，在与罕见疾病相关的数据中学习有意义的聚类分数非常困难。为了解决这个问题，我们提出了一种基于混合的对比聚类方法，即选择一小批数据来模拟罕见疾病的实验环境。对比学习模块根据阳性样本对共享阴性样本这一事实优化特征空间，并采用聚类方法将具有可比语义特征的数据分组。该模块缓解了数据重叠的问题，同时混合生成了具有成本效益的虚拟特征，即使使用小批量数据也能获得出色的实验得分，并减少了资源使用和时间开销。我们所建议的技术已取得了尖端成果，并体现了无监控文本聚类的有利策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in Computational Neuroscience MATHEMATICAL & COMPUTATIONAL BIOLOGY-NEUROSCIENCES

CiteScore

5.30

自引率

3.10%

发文量

166

审稿时长

6-12 weeks

期刊介绍： Frontiers in Computational Neuroscience is a first-tier electronic journal devoted to promoting theoretical modeling of brain function and fostering interdisciplinary interactions between theoretical and experimental neuroscience. Progress in understanding the amazing capabilities of the brain is still limited, and we believe that it will only come with deep theoretical thinking and mutually stimulating cooperation between different disciplines and approaches. We therefore invite original contributions on a wide range of topics that present the fruits of such cooperation, or provide stimuli for future alliances. We aim to provide an interactive forum for cutting-edge theoretical studies of the nervous system, and for promulgating the best theoretical research to the broader neuroscience community. Models of all styles and at all levels are welcome, from biophysically motivated realistic simulations of neurons and synapses to high-level abstract models of inference and decision making. While the journal is primarily focused on theoretically based and driven research, we welcome experimental studies that validate and test theoretical conclusions. Also: comp neuro