{"title":"A Comparison of Several Word Clustering Models","authors":"Lichi Yuan","doi":"10.1109/IAEAC47372.2019.8997887","DOIUrl":null,"url":null,"abstract":"Sparse-data problem is a main issue that influences the performances of statistical language models; statistical language model based on word classes is an effective method to solve sparse-data problems. This paper presents a definition of word similarity by utilizing mutual information of adjoining words, and gives the definition of word set similarity based on word similarity, and puts forward a bottom-up hierarchical word clustering algorithm which can get global optimum. Experimental results show that the word clustering algorithm is of high executing speed and have good clustering performances. We then interpolated the class-based models with the word-based models and found that it mitigates remaining sparse-data problems of statistical language models.","PeriodicalId":164163,"journal":{"name":"2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC47372.2019.8997887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Sparse-data problem is a main issue that influences the performances of statistical language models; statistical language model based on word classes is an effective method to solve sparse-data problems. This paper presents a definition of word similarity by utilizing mutual information of adjoining words, and gives the definition of word set similarity based on word similarity, and puts forward a bottom-up hierarchical word clustering algorithm which can get global optimum. Experimental results show that the word clustering algorithm is of high executing speed and have good clustering performances. We then interpolated the class-based models with the word-based models and found that it mitigates remaining sparse-data problems of statistical language models.