{"title":"Chinese Word Segmentation Based on Maximum Entropy","authors":"Xiaolin Li, Zerong Hu, Tao Lu","doi":"10.1145/3366715.3366741","DOIUrl":null,"url":null,"abstract":"Chinese word segmentation has received extensive attention in recent years. The word segmentation method based on character-based tagging improves the performance of word segmentation greatly. This method transforms the word segmentation problem into a sequence labeling problem, which has become the main word segmentation method. In order to further study the word segmentation performance of this method, we use the maximum entropy sequence labeling model in this paper. We used two different word position sets and three feature templates to compare the experimental results. We have done further research on the unknown words and segmentation ambiguity in the word segmentation results. First we combined N-Gram with cohesion and degree of freedom to solve the problem of unknown words. Then the maximum entropy model is used to train the new participle to eliminate the ambiguity. The closed test was conducted on the Bakeoff 2005 corpus of the international Chinese word segmentation evaluation. Experiments show that the six-tag position combined with the corresponding feature templates can achieve better word segmentation performance. After adding unknown words and disambiguation processing, the word segmentation performance of some data sets can be further improved to optimal results of Bakeoff 2005.","PeriodicalId":425980,"journal":{"name":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 International Conference on Robotics Systems and Vehicle Technology - RSVT '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366715.3366741","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Chinese word segmentation has received extensive attention in recent years. The word segmentation method based on character-based tagging improves the performance of word segmentation greatly. This method transforms the word segmentation problem into a sequence labeling problem, which has become the main word segmentation method. In order to further study the word segmentation performance of this method, we use the maximum entropy sequence labeling model in this paper. We used two different word position sets and three feature templates to compare the experimental results. We have done further research on the unknown words and segmentation ambiguity in the word segmentation results. First we combined N-Gram with cohesion and degree of freedom to solve the problem of unknown words. Then the maximum entropy model is used to train the new participle to eliminate the ambiguity. The closed test was conducted on the Bakeoff 2005 corpus of the international Chinese word segmentation evaluation. Experiments show that the six-tag position combined with the corresponding feature templates can achieve better word segmentation performance. After adding unknown words and disambiguation processing, the word segmentation performance of some data sets can be further improved to optimal results of Bakeoff 2005.