BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining

2021 IEEE International Conference on Data Mining (ICDM) Pub Date : 2021-12-01 DOI:10.1109/ICDM51629.2021.00181

Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu

{"title":"BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining","authors":"Xiaosu Wang, Yun Xiong, Hao Niu, Jingwen Yue, Yangyong Zhu, Philip S. Yu","doi":"10.1109/ICDM51629.2021.00181","DOIUrl":null,"url":null,"abstract":"Unsupervised pre-trained language models (PLMs) have boosted the development of effective biomedical text mining models. But the biomedical texts contain a huge number of long-tail concepts and terminologies, which makes further pre-training on biomedical corpora relatively expensive (more biomedical corpora and more pre-training steps are needed). Nonetheless, this problem receives less attention in recent studies. In Chinese biomedical text, concepts and terminologies consist of Chinese characters, and Chinese characters are often composed of sub-character components which are also semantically informative; thus in order to enhance the semantics of biomedical concepts and terminologies, the use of a Chinese character’s component-level internal semantic information also appears to be reasonable.In this paper, we propose a novel hanzi-aware pre-trained language model for Chinese biomedical text mining, referred to as BioHanBERT (hanzi-aware BERT for Chinese biomedical text mining), utilizing the component-level internal semantic information of Chinese characters to enhance the semantics of Chinese biomedical concepts and terminologies, and thereby to reduce further pre-training costs. BioHanBERT first employs a Chinese character encoder to extract the component-level internal semantic feature of each Chinese character, and then fuse the character’s internal semantic feature and its contextual embedding extracted by BERT to enrich the representations of the concepts or terminologies containing the character. The results of extensive experiments show that our model is able to consistently outperform current state-of-the-art (SOTA) models in a wide range of Chinese biomedical natural language processing (NLP) tasks.","PeriodicalId":320970,"journal":{"name":"2021 IEEE International Conference on Data Mining (ICDM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM51629.2021.00181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Unsupervised pre-trained language models (PLMs) have boosted the development of effective biomedical text mining models. But the biomedical texts contain a huge number of long-tail concepts and terminologies, which makes further pre-training on biomedical corpora relatively expensive (more biomedical corpora and more pre-training steps are needed). Nonetheless, this problem receives less attention in recent studies. In Chinese biomedical text, concepts and terminologies consist of Chinese characters, and Chinese characters are often composed of sub-character components which are also semantically informative; thus in order to enhance the semantics of biomedical concepts and terminologies, the use of a Chinese character’s component-level internal semantic information also appears to be reasonable.In this paper, we propose a novel hanzi-aware pre-trained language model for Chinese biomedical text mining, referred to as BioHanBERT (hanzi-aware BERT for Chinese biomedical text mining), utilizing the component-level internal semantic information of Chinese characters to enhance the semantics of Chinese biomedical concepts and terminologies, and thereby to reduce further pre-training costs. BioHanBERT first employs a Chinese character encoder to extract the component-level internal semantic feature of each Chinese character, and then fuse the character’s internal semantic feature and its contextual embedding extracted by BERT to enrich the representations of the concepts or terminologies containing the character. The results of extensive experiments show that our model is able to consistently outperform current state-of-the-art (SOTA) models in a wide range of Chinese biomedical natural language processing (NLP) tasks.

查看原文本刊更多论文

中文生物医学文本挖掘的汉字感知预训练语言模型

无监督预训练语言模型(PLMs)促进了有效的生物医学文本挖掘模型的发展。但是生物医学文本包含了大量的长尾概念和术语，这使得对生物医学语料库的进一步预训练相对昂贵(需要更多的生物医学语料库和更多的预训练步骤)。然而，这一问题在最近的研究中得到的关注较少。在中文生物医学文本中，概念和术语由汉字组成，而汉字通常由具有语义信息的子汉字组成;因此，为了增强生物医学概念和术语的语义，使用汉字的成分级内部语义信息也是合理的。本文提出了一种新的中文生物医学文本挖掘的汉字感知预训练语言模型，即BioHanBERT(中文生物医学文本挖掘的汉字感知BERT)，利用汉字的组件级内部语义信息来增强中文生物医学概念和术语的语义，从而降低进一步的预训练成本。BioHanBERT首先使用汉字编码器提取每个汉字的组件级内部语义特征，然后将BERT提取的汉字内部语义特征与其上下文嵌入融合，以丰富包含该汉字的概念或术语的表示。广泛的实验结果表明，我们的模型能够在广泛的中文生物医学自然语言处理(NLP)任务中始终优于当前最先进的(SOTA)模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量