{"title":"Enhancing CRF-based Chinese Word Segmentation Using a Rapid and Effective Feature Template Selection Algorithm and Character Normalization","authors":"Yulin Ren, Dehua Li","doi":"10.1109/icomssc45026.2018.8941808","DOIUrl":null,"url":null,"abstract":"Conditional random fields (CRFs) are among the classic models for Chinese word segmentation (CWS). Deep neural networks (DNNs) have recently emerged as a research hotspot in natural language processing (NLP). However, studies exploring the use of DNN for CWS have not yielded significant gains over CRF models. Thus, developing CRFs for CWS remains a viable avenue for research. This paper proposes two methods to enhance CRF-based CWS. First, a rapid and effective sequential forward selection (SFS)-style method is utilized for feature template selection to balance search performance with search speed. Second, it describes a method for character normalization more robust than the traditional method. Incremental evaluations on the second SIGHAN bakeoff show that the two proposed methods reduce the error by 7.8%, and 10.6% respectively in terms of F-score. The final system achieved an F-score of 0.955 (AS), 0.955 (CITYU), 0.970 (MSR), and 0.952 (PKU), which is comparable to those of the best systems reported in the reference.","PeriodicalId":332213,"journal":{"name":"2018 International Computers, Signals and Systems Conference (ICOMSSC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Computers, Signals and Systems Conference (ICOMSSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icomssc45026.2018.8941808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Conditional random fields (CRFs) are among the classic models for Chinese word segmentation (CWS). Deep neural networks (DNNs) have recently emerged as a research hotspot in natural language processing (NLP). However, studies exploring the use of DNN for CWS have not yielded significant gains over CRF models. Thus, developing CRFs for CWS remains a viable avenue for research. This paper proposes two methods to enhance CRF-based CWS. First, a rapid and effective sequential forward selection (SFS)-style method is utilized for feature template selection to balance search performance with search speed. Second, it describes a method for character normalization more robust than the traditional method. Incremental evaluations on the second SIGHAN bakeoff show that the two proposed methods reduce the error by 7.8%, and 10.6% respectively in terms of F-score. The final system achieved an F-score of 0.955 (AS), 0.955 (CITYU), 0.970 (MSR), and 0.952 (PKU), which is comparable to those of the best systems reported in the reference.