LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning

arXiv - CS - Social and Information Networks Pub Date : 2024-09-02 DOI:arxiv-2409.01145

Haoran Yang, Xiangyu Zhao, Sirui Huang, Qing Li, Guandong Xu

{"title":"LATEX-GCL: Large Language Models (LLMs)-Based Data Augmentation for Text-Attributed Graph Contrastive Learning","authors":"Haoran Yang, Xiangyu Zhao, Sirui Huang, Qing Li, Guandong Xu","doi":"arxiv-2409.01145","DOIUrl":null,"url":null,"abstract":"Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised\ngraph learning that has attracted attention across various application\nscenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet\nto be explored. Because conventional augmentation techniques like feature\nembedding masking cannot directly process textual attributes on TAGs. A naive\nstrategy for applying GCL to TAGs is to encode the textual attributes into\nfeature embeddings via a language model and then feed the embeddings into the\nfollowing GCL module for processing. Such a strategy faces three key\nchallenges: I) failure to avoid information loss, II) semantic loss during the\ntext encoding phase, and III) implicit augmentation constraints that lead to\nuncontrollable and incomprehensible results. In this paper, we propose a novel\nGCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to\nproduce textual augmentations and LLMs' powerful natural language processing\n(NLP) abilities to address the three limitations aforementioned to pave the way\nfor applying GCL to TAG tasks. Extensive experiments on four high-quality TAG\ndatasets illustrate the superiority of the proposed LATEX-GCL method. The\nsource codes and datasets are released to ease the reproducibility, which can\nbe accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.","PeriodicalId":501032,"journal":{"name":"arXiv - CS - Social and Information Networks","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Social and Information Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01145","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Graph Contrastive Learning (GCL) is a potent paradigm for self-supervised graph learning that has attracted attention across various application scenarios. However, GCL for learning on Text-Attributed Graphs (TAGs) has yet to be explored. Because conventional augmentation techniques like feature embedding masking cannot directly process textual attributes on TAGs. A naive strategy for applying GCL to TAGs is to encode the textual attributes into feature embeddings via a language model and then feed the embeddings into the following GCL module for processing. Such a strategy faces three key challenges: I) failure to avoid information loss, II) semantic loss during the text encoding phase, and III) implicit augmentation constraints that lead to uncontrollable and incomprehensible results. In this paper, we propose a novel GCL framework named LATEX-GCL to utilize Large Language Models (LLMs) to produce textual augmentations and LLMs' powerful natural language processing (NLP) abilities to address the three limitations aforementioned to pave the way for applying GCL to TAG tasks. Extensive experiments on four high-quality TAG datasets illustrate the superiority of the proposed LATEX-GCL method. The source codes and datasets are released to ease the reproducibility, which can be accessed via this link: https://anonymous.4open.science/r/LATEX-GCL-0712.

查看原文本刊更多论文

LATEX-GCL：基于大语言模型（LLMs）的文本归因图对比学习数据扩展

图对比学习（GCL）是一种有效的自监督图学习范式，在各种应用场景中都备受关注。然而，用于文本属性图（TAG）学习的 GCL 还有待探索。因为传统的增强技术（如特征嵌入屏蔽）无法直接处理 TAG 上的文本属性。将 GCL 应用于 TAG 的一种原始策略是通过语言模型将文本属性编码为特征嵌入，然后将嵌入输入到后续的 GCL 模块中进行处理。这种策略面临三个主要挑战：I) 无法避免信息丢失；II) 文本编码阶段的语义丢失；III) 隐式扩增约束导致结果难以控制和理解。在本文中，我们提出了一种名为 LATEX-GCL 的新型 GCL 框架，利用大语言模型（LLM）生成文本增强，并利用 LLM 强大的自然语言处理（NLP）能力来解决上述三个局限性，从而为将 GCL 应用于 TAG 任务铺平道路。在四个高质量 TAG 数据集上进行的广泛实验证明了所提出的 LATEX-GCL 方法的优越性。为了便于重现，我们发布了源代码和数据集，可通过以下链接访问：https://anonymous.4open.science/r/LATEX-GCL-0712。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Social and Information Networks

自引率

0.00%

发文量