{"title":"A word distributed representation based framework for large-scale short text classification","authors":"Di Yao, Jingping Bi, Jianhui Huang, Jin Zhu","doi":"10.1109/IJCNN.2015.7280513","DOIUrl":null,"url":null,"abstract":"With the development of internet, there are billions of short texts generated each day. However, the accuracy of large scale short text classification is poor due to the data sparseness. Traditional methods used to use external dataset to enrich the representation of document and solve the data sparsity problem. But external dataset which matches the specific short texts is hard to find. In this paper, we propose a framework to solve the data sparsity problem without using external dataset. Our framework deal with large scale short text by making the most of semantic similarity of words which learned from the training short texts. First, we learn word distributed representation and measure the word semantic similarity from the training short texts. Then, we propose a method which enrich the document representation by using the word semantic similarity information. At last, we build classifiers based on the enriched representation. We evaluate our framework on both the benchmark dataset(Standford Sentiment Treebank) and the large scale Chinese news title dataset which collected by ourselves. For the benchmark dataset, using our framework can improve 3% classification accuracy. The result we tested on the large scale Chinese news title dataset shows that our framework achieve better result with the increase of the training set size.","PeriodicalId":6539,"journal":{"name":"2015 International Joint Conference on Neural Networks (IJCNN)","volume":"28 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2015.7280513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
With the development of internet, there are billions of short texts generated each day. However, the accuracy of large scale short text classification is poor due to the data sparseness. Traditional methods used to use external dataset to enrich the representation of document and solve the data sparsity problem. But external dataset which matches the specific short texts is hard to find. In this paper, we propose a framework to solve the data sparsity problem without using external dataset. Our framework deal with large scale short text by making the most of semantic similarity of words which learned from the training short texts. First, we learn word distributed representation and measure the word semantic similarity from the training short texts. Then, we propose a method which enrich the document representation by using the word semantic similarity information. At last, we build classifiers based on the enriched representation. We evaluate our framework on both the benchmark dataset(Standford Sentiment Treebank) and the large scale Chinese news title dataset which collected by ourselves. For the benchmark dataset, using our framework can improve 3% classification accuracy. The result we tested on the large scale Chinese news title dataset shows that our framework achieve better result with the increase of the training set size.