{"title":"Feature Abstraction for Lightweight and Accurate Chinese Word Segmentation","authors":"Le Tian, Xipeng Qiu, Xuanjing Huang","doi":"10.1109/IALP.2013.65","DOIUrl":null,"url":null,"abstract":"Chinese word segmentation (CWS) is an important and necessary problem to analyze Chinese texts. The state-of-art CWS systems are mostly based on sequence labeling algorithm and use the discriminative model with millions of overlapping binary features. However, there are few works on porting these systems to the devices with limited computing capacity and memory. In this paper, we focus on two challenges in Chinese word segmentation: (1) low accuracy of out-of-vocabulary word and (2) huge feature space. To resolve these two difficult problems, we propose a method to abstract the original input on both character and feature levels. We group the \"similar'' features to generate more abstract representation. Experimental results show that feature abstraction can greatly reduce the feature space with a comparable performance.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Chinese word segmentation (CWS) is an important and necessary problem to analyze Chinese texts. The state-of-art CWS systems are mostly based on sequence labeling algorithm and use the discriminative model with millions of overlapping binary features. However, there are few works on porting these systems to the devices with limited computing capacity and memory. In this paper, we focus on two challenges in Chinese word segmentation: (1) low accuracy of out-of-vocabulary word and (2) huge feature space. To resolve these two difficult problems, we propose a method to abstract the original input on both character and feature levels. We group the "similar'' features to generate more abstract representation. Experimental results show that feature abstraction can greatly reduce the feature space with a comparable performance.