Enhancing CRF-based Chinese Word Segmentation Using a Rapid and Effective Feature Template Selection Algorithm and Character Normalization

Yulin Ren, Dehua Li
{"title":"Enhancing CRF-based Chinese Word Segmentation Using a Rapid and Effective Feature Template Selection Algorithm and Character Normalization","authors":"Yulin Ren, Dehua Li","doi":"10.1109/icomssc45026.2018.8941808","DOIUrl":null,"url":null,"abstract":"Conditional random fields (CRFs) are among the classic models for Chinese word segmentation (CWS). Deep neural networks (DNNs) have recently emerged as a research hotspot in natural language processing (NLP). However, studies exploring the use of DNN for CWS have not yielded significant gains over CRF models. Thus, developing CRFs for CWS remains a viable avenue for research. This paper proposes two methods to enhance CRF-based CWS. First, a rapid and effective sequential forward selection (SFS)-style method is utilized for feature template selection to balance search performance with search speed. Second, it describes a method for character normalization more robust than the traditional method. Incremental evaluations on the second SIGHAN bakeoff show that the two proposed methods reduce the error by 7.8%, and 10.6% respectively in terms of F-score. The final system achieved an F-score of 0.955 (AS), 0.955 (CITYU), 0.970 (MSR), and 0.952 (PKU), which is comparable to those of the best systems reported in the reference.","PeriodicalId":332213,"journal":{"name":"2018 International Computers, Signals and Systems Conference (ICOMSSC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Computers, Signals and Systems Conference (ICOMSSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icomssc45026.2018.8941808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Conditional random fields (CRFs) are among the classic models for Chinese word segmentation (CWS). Deep neural networks (DNNs) have recently emerged as a research hotspot in natural language processing (NLP). However, studies exploring the use of DNN for CWS have not yielded significant gains over CRF models. Thus, developing CRFs for CWS remains a viable avenue for research. This paper proposes two methods to enhance CRF-based CWS. First, a rapid and effective sequential forward selection (SFS)-style method is utilized for feature template selection to balance search performance with search speed. Second, it describes a method for character normalization more robust than the traditional method. Incremental evaluations on the second SIGHAN bakeoff show that the two proposed methods reduce the error by 7.8%, and 10.6% respectively in terms of F-score. The final system achieved an F-score of 0.955 (AS), 0.955 (CITYU), 0.970 (MSR), and 0.952 (PKU), which is comparable to those of the best systems reported in the reference.
利用快速有效的特征模板选择算法和字符归一化增强基于crf的中文分词
条件随机场是中文分词的经典模型之一。深度神经网络(dnn)是近年来自然语言处理(NLP)领域的研究热点。然而,探索在CWS中使用DNN的研究并没有取得比CRF模型更大的进展。因此,开发CWS的CRFs仍然是一个可行的研究途径。本文提出了两种增强基于crf的CWS的方法。首先,采用快速有效的顺序前向选择(SFS)方法进行特征模板选择,以平衡搜索性能和搜索速度;其次,描述了一种比传统方法更具鲁棒性的字符规范化方法。对第二次sigan烘焙的增量评价表明,两种方法的F-score误差分别降低了7.8%和10.6%。最终系统的f得分分别为0.955 (AS)、0.955 (CITYU)、0.970 (MSR)和0.952 (PKU),与文献中报道的最佳系统相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信