Enhancing pre-trained language models with Chinese character morphological knowledge

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2024-11-06 DOI:10.1016/j.ipm.2024.103945

Zhenzhong Zheng , Xiaoming Wu , Xiangzhi Liu

{"title":"Enhancing pre-trained language models with Chinese character morphological knowledge","authors":"Zhenzhong Zheng , Xiaoming Wu , Xiangzhi Liu","doi":"10.1016/j.ipm.2024.103945","DOIUrl":null,"url":null,"abstract":"<div><div>Pre-trained language models (PLMs) have demonstrated success in Chinese natural language processing (NLP) tasks by acquiring high-quality representations through contextual learning. However, these models tend to neglect the glyph features of Chinese characters, which contain valuable semantic knowledge. To address this issue, this paper introduces a self-supervised learning strategy, named SGBERT, aiming to learn high-quality semantic knowledge from Chinese Character morphology to enhance PLMs’ understanding of natural language. Specifically, the learning process of SGBERT can be divided into two stages. In the first stage, we preheat the glyph encoder by constructing contrastive learning between glyphs, enabling it to obtain preliminary glyph coding capabilities. In the second stage, we transform the glyph features captured by the glyph encoder into context-sensitive representations through a glyph-aware window. These representations are then contrasted with the character representations generated by the PLMs, leveraging the powerful representation capabilities of the PLMs to guide glyph learning. Finally, the glyph knowledge is fused with the pre-trained model representations to obtain semantically richer representations. We conduct experiments on ten datasets covering six Chinese NLP tasks, and the results demonstrate that SGBERT significantly enhances commonly used Chinese PLMs. On average, the introduction of SGBERT resulted in a performance improvement of 1.36% for BERT and 1.09% for RoBERTa.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 1","pages":"Article 103945"},"PeriodicalIF":7.4000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003042","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-trained language models (PLMs) have demonstrated success in Chinese natural language processing (NLP) tasks by acquiring high-quality representations through contextual learning. However, these models tend to neglect the glyph features of Chinese characters, which contain valuable semantic knowledge. To address this issue, this paper introduces a self-supervised learning strategy, named SGBERT, aiming to learn high-quality semantic knowledge from Chinese Character morphology to enhance PLMs’ understanding of natural language. Specifically, the learning process of SGBERT can be divided into two stages. In the first stage, we preheat the glyph encoder by constructing contrastive learning between glyphs, enabling it to obtain preliminary glyph coding capabilities. In the second stage, we transform the glyph features captured by the glyph encoder into context-sensitive representations through a glyph-aware window. These representations are then contrasted with the character representations generated by the PLMs, leveraging the powerful representation capabilities of the PLMs to guide glyph learning. Finally, the glyph knowledge is fused with the pre-trained model representations to obtain semantically richer representations. We conduct experiments on ten datasets covering six Chinese NLP tasks, and the results demonstrate that SGBERT significantly enhances commonly used Chinese PLMs. On average, the introduction of SGBERT resulted in a performance improvement of 1.36% for BERT and 1.09% for RoBERTa.

查看原文本刊更多论文

利用汉字形态知识增强预训练语言模型

预训练语言模型（PLM）通过上下文学习获得高质量的表征，在中文自然语言处理（NLP）任务中取得了成功。然而，这些模型往往忽略了汉字的字形特征，而这些特征包含了宝贵的语义知识。为解决这一问题，本文介绍了一种自监督学习策略（SGBERT），旨在从汉字字形中学习高质量的语义知识，以增强 PLM 对自然语言的理解。具体来说，SGBERT 的学习过程可分为两个阶段。在第一阶段，我们通过构建字形之间的对比学习来预热字形编码器，使其获得初步的字形编码能力。在第二阶段，我们通过字形感知窗口将字形编码器捕捉到的字形特征转换为上下文敏感表征。然后将这些表征与 PLM 生成的字符表征进行对比，利用 PLM 强大的表征能力来指导字形学习。最后，将字形知识与预先训练的模型表征融合，从而获得语义更丰富的表征。我们在涵盖六个中文 NLP 任务的十个数据集上进行了实验，结果表明 SGBERT 显著增强了常用的中文 PLM。平均而言，引入 SGBERT 后，BERT 和 RoBERTa 的性能分别提高了 1.36% 和 1.09%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.