{"title":"Feature extension for Chinese short text classification based on LDA and Word2vec","authors":"Fanke Sun, Heping Chen","doi":"10.1109/ICIEA.2018.8397890","DOIUrl":null,"url":null,"abstract":"Because of the sparse text, the traditional text classification method is difficult to achieve good results in short text classification. In this paper, we design a short text classification method based on word vector and LDA topic model is proposed which considers the factors of Grammatical Category-combined Weight and the Topic High-frequency Word. In this method, Gibbs sampling is used to train LDA topic model on the basis of part of speech weight. The training results are trained by Wor2vec training word vectors, and vectorized with the Topic High Frequency Word. Then feature extend the test text. After expanding the features, the SVM algorithm is used to classify the extended short texts, and the classification results are evaluated by using the precision, F1-score, and recall. The results show that this method can significantly improve classification performance.","PeriodicalId":140420,"journal":{"name":"2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA)","volume":"23 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIEA.2018.8397890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Because of the sparse text, the traditional text classification method is difficult to achieve good results in short text classification. In this paper, we design a short text classification method based on word vector and LDA topic model is proposed which considers the factors of Grammatical Category-combined Weight and the Topic High-frequency Word. In this method, Gibbs sampling is used to train LDA topic model on the basis of part of speech weight. The training results are trained by Wor2vec training word vectors, and vectorized with the Topic High Frequency Word. Then feature extend the test text. After expanding the features, the SVM algorithm is used to classify the extended short texts, and the classification results are evaluated by using the precision, F1-score, and recall. The results show that this method can significantly improve classification performance.