{"title":"DCLCSE: Dynamic Curriculum Learning Based Contrastive Learning of Sentence Embeddings","authors":"Chang Liu;Dacao Zhang;Meng Wang","doi":"10.1109/TBDATA.2024.3423650","DOIUrl":null,"url":null,"abstract":"Recently, Contrastive Learning (CL) has made impressive progress in natural language processing, especially in sentence representation learning. Plenty of data augmentation methods have been proposed for the generation of positive samples. However, due to the highly abstract nature of natural language, these augmentations cannot maintain the quality of generated positive samples, e.g., too easy or hard samples. To this end, we propose to improve the quality of positive examples from a data arrangement perspective and develop a novel model-agnostic approach: <italic>Dynamic Curriculum Learning based Contrastive Sentence Embedding framework</i> (<italic>DCLCSE</i>) for sentence embeddings. Specifically, we propose to incorporate a curriculum learning strategy to control the positive example usage. At the early learning stage, easy samples are selected to optimize the CL-based model. As the model's capability increases, we gradually select harder samples for model training, ensuring the learning efficiency of the model. Furthermore, we design a novel difficulty measurement module to calculate the difficulty of generated positives, in which the model's capability is considered for the accurate sample difficulty measurement. Based on this, we develop multiple arrangement strategies to facilitate the model learning process based on learned difficulties. Finally, extensive experiments over multiple representative models demonstrate the superiority of <italic>DCLCSE</i>. As a byproduct, we have released the codes to facilitate other researchers.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"635-647"},"PeriodicalIF":7.5000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10587121/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, Contrastive Learning (CL) has made impressive progress in natural language processing, especially in sentence representation learning. Plenty of data augmentation methods have been proposed for the generation of positive samples. However, due to the highly abstract nature of natural language, these augmentations cannot maintain the quality of generated positive samples, e.g., too easy or hard samples. To this end, we propose to improve the quality of positive examples from a data arrangement perspective and develop a novel model-agnostic approach: Dynamic Curriculum Learning based Contrastive Sentence Embedding framework (DCLCSE) for sentence embeddings. Specifically, we propose to incorporate a curriculum learning strategy to control the positive example usage. At the early learning stage, easy samples are selected to optimize the CL-based model. As the model's capability increases, we gradually select harder samples for model training, ensuring the learning efficiency of the model. Furthermore, we design a novel difficulty measurement module to calculate the difficulty of generated positives, in which the model's capability is considered for the accurate sample difficulty measurement. Based on this, we develop multiple arrangement strategies to facilitate the model learning process based on learned difficulties. Finally, extensive experiments over multiple representative models demonstrate the superiority of DCLCSE. As a byproduct, we have released the codes to facilitate other researchers.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.