BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain

IF 2.7 Q1 GEOGRAPHY
Qinjun Qiu, Zhong Xie, K. Ma, Miao Tian
{"title":"BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain","authors":"Qinjun Qiu, Zhong Xie, K. Ma, Miao Tian","doi":"10.1080/19475683.2023.2186487","DOIUrl":null,"url":null,"abstract":"ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.","PeriodicalId":46270,"journal":{"name":"Annals of GIS","volume":"16 1","pages":"387 - 399"},"PeriodicalIF":2.7000,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of GIS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/19475683.2023.2186487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY","Score":null,"Total":0}
引用次数: 0

Abstract

ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.
基于BERT方法的地球科学领域无监督多粒度中文分词
与英语等基于字母的语言不同,汉语没有特定的单词边界。摘要分词是实现中文文本处理、信息检索和知识发现的重要步骤。在地球科学领域,大多数现有的中文分词工具/模型都需要预先指定词典和大量相关的训练语料库,使用相同的方法处理域外情况时,分词准确率明显下降。为了解决这一问题,提出了一种纯无监督通用的两阶段中文分词体系结构(BERTCWS)。我们首先设计了一个称为“字符组合紧密度”的关联矩阵来计算字符之间的紧密度。然后,BERTCWS基于基于变形器(BERT)的双向编码器表示分割器识别地球科学术语,并通过设置不同的阈值生成多粒度分割。最后,构造鉴别器来验证分词的正确性。我们的数值研究表明,BERTCWS既可以识别一般领域术语,也可以识别地球科学领域术语。此外,多颗粒分段可以提供一组不同长度的潜在地球科学术语。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of GIS
Annals of GIS Multiple-
CiteScore
8.30
自引率
2.00%
发文量
31
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信