DCLCSE: Dynamic Curriculum Learning Based Contrastive Learning of Sentence Embeddings

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2024-07-05 DOI:10.1109/TBDATA.2024.3423650

Chang Liu;Dacao Zhang;Meng Wang

{"title":"DCLCSE: Dynamic Curriculum Learning Based Contrastive Learning of Sentence Embeddings","authors":"Chang Liu;Dacao Zhang;Meng Wang","doi":"10.1109/TBDATA.2024.3423650","DOIUrl":null,"url":null,"abstract":"Recently, Contrastive Learning (CL) has made impressive progress in natural language processing, especially in sentence representation learning. Plenty of data augmentation methods have been proposed for the generation of positive samples. However, due to the highly abstract nature of natural language, these augmentations cannot maintain the quality of generated positive samples, e.g., too easy or hard samples. To this end, we propose to improve the quality of positive examples from a data arrangement perspective and develop a novel model-agnostic approach: <italic>Dynamic Curriculum Learning based Contrastive Sentence Embedding framework</i> (<italic>DCLCSE</i>) for sentence embeddings. Specifically, we propose to incorporate a curriculum learning strategy to control the positive example usage. At the early learning stage, easy samples are selected to optimize the CL-based model. As the model's capability increases, we gradually select harder samples for model training, ensuring the learning efficiency of the model. Furthermore, we design a novel difficulty measurement module to calculate the difficulty of generated positives, in which the model's capability is considered for the accurate sample difficulty measurement. Based on this, we develop multiple arrangement strategies to facilitate the model learning process based on learned difficulties. Finally, extensive experiments over multiple representative models demonstrate the superiority of <italic>DCLCSE</i>. As a byproduct, we have released the codes to facilitate other researchers.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"635-647"},"PeriodicalIF":7.5000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10587121/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, Contrastive Learning (CL) has made impressive progress in natural language processing, especially in sentence representation learning. Plenty of data augmentation methods have been proposed for the generation of positive samples. However, due to the highly abstract nature of natural language, these augmentations cannot maintain the quality of generated positive samples, e.g., too easy or hard samples. To this end, we propose to improve the quality of positive examples from a data arrangement perspective and develop a novel model-agnostic approach: Dynamic Curriculum Learning based Contrastive Sentence Embedding framework (DCLCSE) for sentence embeddings. Specifically, we propose to incorporate a curriculum learning strategy to control the positive example usage. At the early learning stage, easy samples are selected to optimize the CL-based model. As the model's capability increases, we gradually select harder samples for model training, ensuring the learning efficiency of the model. Furthermore, we design a novel difficulty measurement module to calculate the difficulty of generated positives, in which the model's capability is considered for the accurate sample difficulty measurement. Based on this, we develop multiple arrangement strategies to facilitate the model learning process based on learned difficulties. Finally, extensive experiments over multiple representative models demonstrate the superiority of DCLCSE. As a byproduct, we have released the codes to facilitate other researchers.

查看原文本刊更多论文

基于动态课程学习的句子嵌入对比学习

近年来，对比学习（CL）在自然语言处理特别是句子表征学习方面取得了令人瞩目的进展。为了生成阳性样本，已经提出了大量的数据增强方法。然而，由于自然语言的高度抽象性，这些增强不能保持生成的阳性样本的质量，例如，太容易或太硬的样本。为此，我们建议从数据排列的角度提高正例的质量，并开发一种新的模型不可知的方法：基于动态课程学习的句子嵌入对比框架（DCLCSE）。具体而言，我们建议结合课程学习策略来控制正例的使用。在早期学习阶段，选择简单的样本来优化基于cl的模型。随着模型能力的提高，我们逐渐选择难度较大的样本进行模型训练，保证了模型的学习效率。此外，我们设计了一种新的难度测量模块来计算生成阳性的难度，其中考虑了模型的能力来精确测量样本的难度。在此基础上，我们开发了多种安排策略来促进基于学习困难的模型学习过程。最后，在多个代表性模型上进行了大量实验，验证了DCLCSE的优越性。作为一种副产品，我们已经发布了代码，以方便其他研究人员。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.